Copyright © 2015, American-Eurasian Network for Scientific Information publisher
JOURNAL OF APPLIED SCIENCES RESEARCH
ISSN: 1819-544X
EISSN: 1816-157X
JOURNAL home page: http://www.aensiweb.com/JASR
2015 October; 11(19): pages 14-19.
Published Online 10 November 2015.
Research Article
A Survey on Secrecy Preserving Discovery of Subtle Statistic Contents
1A.Francis
1PG
Thivya and 2P.Tharcis
Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India.
Professor, Department of Electronics and Communication Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India.
2Assistant
Received: 23 September 2015; Accepted: 25 October 2015
© 2015
AENSI PUBLISHER All rights reserved
ABSTRACT
A Data Distributor or Software Company has given sensitive data to a set of confidential data owners. Sometimes data is leaked and
found in the unauthorized place. Organization may outsource the data after pre-processing; data might be given to various clients. Data
leakage happens every day when the sensitive information is transferred to the third party agents when this information is leaked out, and
then the companies are at serious problem. To manage and protect the confidential information, data leak detection is an important part.
Data leaks on organization networks may be caused by mistakes of users. There exist solution the semi-honest data owner detect the
sensitive data with false positive alarms and keyword do not cover enough sensitive data segments for data leak detection. In this paper,
we surveyed how to detect the data leaks by using various techniques and how to securely transfer the data by using various encryption
algorithms.
Keywords: Data leak detection provider, Fuzzy fingerprint, Host-based data leak detection, Semi-honest adversary.
INTRODUCTION
Information Security has always had an
important role as technology has advanced; it has
become one of the hottest topics in last few decades.
Information Forensic and Security is the
investigation and analysis technique to gather and
preserve information from a particular computer
device. Information Security is the set of business
processes that protects information assets. It doesn’t
concentrate on how the information is formatted or
produced.
Information security is very important for
maintaining the availability, confidentiality and
integrity of information technology system and
business data. Most of the large enterprises employ a
dedicated security group to implement and manage
the organizations information security program.
Detecting and preventing data leak involves a set of
solutions including data confinement [6], stealthy
malware detection, policy enforcements and data
leak detection. Typical approaches to preventing data
leaks are under two categories [10] -- i) Host-based
Solution and ii) Network-based Solution.
Network-based data leak detection focus on
analyzing unencrypted outbound network traffic
through i) Deep packet inspection and ii) Information
theoretic analysis. In deep packet inspection,
inspecting every packet for the occurrence of the
sensitive data defined in database. This solution
generates alert, if the sensitive data found in the
outgoing network traffic [1].
Network
based
Data
leak
detection
accompaniments host-based Approach. Host based
data leak detection typically perform i) Encrypt the
data when not used. ii) Detecting stealthy malware
with antivirus scanning the Host. iii) Enforcing
policies to restrict the transfer of sensitive data [10].
In Host-based data leak detection approach, the
data owner computes a set of digests from the
sensitive data and reveals only a small amount to the
Data Leak Detection (DLD) providers. The DLD
provider computes the fingerprint from network
traffic and identifies the potential leaks. The
Corresponding Author: P.Tharcis, Assistant Professor, Department of Electronics and Communication Engineering,
Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India.
15
A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19
collection of leakage consists of real leaks and noises
[1], [15]. The data owner perform post-process to
determine whether there is any real leak by sending
potential leak to the DLD provider. The DLD
provider obtains digests the sensitive data. The data
owner uses Robin fingerprint algorithm and a sliding
window to generate the one way computation
through the fast polynomial modulus operation (i.e.
short and hard-to-reverse) [10], [13].
To detect the data leak and achieve the privacy,
data owner generates the set of digests called fuzzy
fingerprints. Fuzzy fingerprint is used to hide the
sensitive data in the network traffic [10]. This system
introduces the matching instance on DLD provider
side without increasing false positive alarms to the
data owners. The DLD provider detects leaks by
Range-based comparison rather than exact match.
The DLD provider is a semi-honest adversary [1].
The DLD provider performs inspection on network
traffic but the provider may attempt to learn about
the information about sensitive data provided by data
owner.
Fig. 1: Flow diagram for data leak detection model
Fig.1 illustrates the flow of data leak detection
model. In DLD model each and every data owner has
its individual username and password. If the Data
owner wants to transfer the data to the receiver, data
owner must perform the preprocessing operation to
remove the noise and make the data as consistent.
Fuzzy fingerprint are generated to hide the sensitive
data in crowd (Network Traffic). SHA algorithm is
used to provide the hash value of the document. The
data is transferred through the DLD provider to
identify potential data leaks. DLD provider again
generates the hash value of the data and compared
with the original hash value to identify leaks. Data
owner computes the Data leak which is send by the
DLD provider by comparing the fuzzy value with the
original fuzzy fingerprint set. To overcome the Data
leak shortest path is used to transfer data from the
data owner to the receiver with minimum traffic.
The reminder of this paper is organized as
follows. Section II, describes the Related Works.
Section III summarizes the conclusion.
II. Related Work:
16
A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19
A number of cryptographic primitives provide
network-based privacy preservation methods – i)
Data leak detection methods and ii) Secure keyword
search methods that are overview them below.
A. Data Leak Detection Methods:
1. Decoy Information Technique:
A malicious insider can steal confidential data of
the cloud user in spite of provider taking precaution
steps i) Not to allow the physical access, ii) Zero
tolerance policy for insiders that access the data
storage and iii) logging all access to the service and
use internal audits to find the malicious insider [7].
It uses the Decoy information. Decoy is bogus
(fake) information which can be generated on
demand and serve as a mean of detecting
unauthorized access to information. This technology
is combined with user behavior profiling technology
to secure user information in cloud. Based on the
user activity cloud server send the original or decoy
file.
The Decoys serve two purposes: i) it validates
whether data access is authorized when abnormal
information is detected and ii) Confusing the
attackers with bogus information. It is used to detect
the user behavior and provide security to the original
file but it cannot fulfill the privacy demanded by
common cloud computing services.
2. Fake Object Technique:
Fake object are basically change in original data
which apparently improves the chance to find guilty
agents [2]. The company may be able to add fake
objects to the distributed data in order to improve
distributor effectiveness in detecting guilty agents.
The idea of disturbing or altering data to detect leak
is not new. It can identify unauthorized use of data.
The main disadvantage is it leads to alteration of
original data.
3. Secure Login Technique:
Data distribution can be manipulated for better
data mining to gain better decision and support [4].
Data integration means that no sensitive data can be
revealed during private data mining. Secure Hashing
Algorithm (SHA) provides more security and privacy
data mining model. This method addresses the
problem of releasing private data, different datasets
contain the same set of user are held by two parties.
In data storage and encryption, once data was
stored into the web server; the cloud server will
divide the data into many portions and store all the
data in the separate server. Data Integration and
source code are stored in the data server. Data Admin
is a person integrates the data in the web server.
Before integration, the admin has to be registered on
server. Once the data holder/ admin registered then it
will be encrypted.
The main purpose of this technique is to provide
data integration (i.e.) different type of key is assigned
to determine the authentication over private data.
But, it cannot recuperate the unique data and may
provide collision.
4. Privacy Shield Method:
It is a simple Database query application with
two parties: a server in possession of a Database and
a client performing disjunctive equality queries.
Privacy shield enables a client and server to
exchange data without leaking more than the
required data [6]. Isolated Box (IB) can be
implemented as a part of secure hardware. IB is a
non-colluding, untrusted party connected with both
the server and client. This method explores the
privacy preserving sharing of sensitive information.
It provides a concrete and efficient
representation modeled in the context of simple
Database query. It is secure and efficient techniques
to protect the information. The testing database has
45 searchable attributes and 1 Unsearchable attribute.
5. MapReduce Technique:
MapReduce is a programming framework for
distributed data intensive application. It has been
used to solve surety problem such as spam filtering,
internet traffic analysis and log analysis. Private
multiparty set interaction method exist, the high
computational overhead is a concern for time
sensitive security application such as Data leak
detection [5].
It has the ability to scale and utilize public
resource for the task. In this method, MapReduce
nodes scan contents in data storage or network
transmission for leak without learning what the
sensitive data is. DLD provider receives both
sensitive fingerprint collections and content using
two-phase MapReduce algorithm. By computing the
interaction each content and sensitive collection pair,
its output determines whether the sensitive data is
leaked or not.
It supports the secure outsourcing of the data
leak detection to untrusted MapReduce and cloud
providers. The main drawback is a significant portion
of incident is caused by unintentional mistakes of
data owners.
6. Revolver Technique:
Revolver uses efficient technique to identify the
resemblances between a large numbers of JavaScript
programs and to automatically interpret their
difference to detect evasion. It leverages the
observation that two scripts which are similar should
be classified by web malware detectors [9].
If Revolver finds similarities between a
malicious and the benign scripts, it performs an
additional classification step. Similarities can be
classified by Revolver into 4 possible categories:
Evasions, injections, data dependencies and general
evasions. This technique is used to find the
similarities between JavaScript files.
17
A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19
Revolver analyses the Abstract Syntax Tree
(AST) of particular script instead examining web
pages as a whole. It perform refinement operation i)
Individual AST are extracted from the web pages
obtained from the oracle. ii) Detection status is
determined based on the page classification iii) for
each node in an AST is recorded to determine
whether the corresponding statement was executed.
7. Data Leak Detection Using RSA Algorithm:
This approach uses encrypted object to
determine the agent that leaked data. In encryption,
the information is encoded that it can be read by
authorized party and not by the unauthorized. If
unauthorized agent receives data encryption keeps in
unreadable hand encryption increases confidentiality
[8]. The automatically notification will send to the
distributor defining the agent leaked the particular set
of records.
It protects the data from unauthorized users. It
may reduce the leakage through RSA algorithm and
b y the process of asymmetric key encryption
algorithm for the fake object creation. The important
drawback is , in symmetric encryption a single key is
used to encryption and decryption so the key security
become problem.
8. Black Box Differencing:
It detects information leakage, in which a
process depends on private data is determined by
duplicating the process and provide different input.
The two copies of the data is compared, in the
absence of private data leakage, it match their
runtime behavior [3], [12].
It provides a network wide method of confining
and assuring the flow of sensitive data within the
network. Network based method detects information
leaks by forking copies of process consuming private
data and removing the sensitive data from the input.
Black box differencing is used to mitigate
leakage of private data. It uses data of same size to
prevent odd behavior or crashes in application due to
unexpected input size. The disadvantage of this
approach is it cannot able to monitor encrypted
network traffic without encryption keys. The
scrubbing sensitive data to limit false positive and
possible corruption of application receive scrubbing
input.
9. Fuzzy fingerprint Data leak detection method:
This approach focus on analyzing the
unencrypted outbound network traffic [1],[11] by i)
Deep packet inspection and ii) information theoretic
analysis. In deep packet inspection, every packet is
examined to identify the occurrence of sensitive data.
This is also referred to as fuzzy fingerprint data leak
detection model. The main characteristic of which is
that the detection does not require the data owner to
expose the content of stored sensitive data. Fuzzy
fingerprint algorithm can be used to detect accidental
data leaks due to human error and / or application
flaws and also this algorithm is used to hide the
sensitive data in crowd.
It provide a quantified method to measure the
privacy guarantee and it provide accurate data leak
detection with small number of false positive alarms.
But, it is not efficient enough for practical data leak
inspection and also computational complexity and
realization difficulty.
10. Agent Guilt Model:
Data leak may found in unauthorized place in
the organization. It is either observed or not observed
by the data owner. Data watcher model is used to
identify data leaks and Guilt model capture leakage
where agents can conspire and identify fake tuples.
Guilt model is used to improve the possibility of
identifying guilty third parties [10], [13], [16].
It adds fake object to the distributed set is act as
a type of watermark for entire set, without modifying
any individual member. If an agent gave one or more
fake object that was leaked, then the distributor can
be confident that the agent was guilty.
It improves the distributor chance to identify a
leaker or guilt agent. The important drawback is that
data leak observed or sometimes not observed by the
data owner.
11. Watermarking Technique:
Data leak detection is covered by watermarking,
e.g., a unique code is embedded in each distributed
copy. If that copy is later detected in an unauthorized
person, the leaker can be identified [14], [15].
Watermarks can be very useful but involve some
modification of the original data. Example a
company may have partnerships with other company
that requires sharing customer’s data. Another
company may outsource its data processing, so data
must be given to various other companies.
There are two major disadvantages. It involves
some modification of data and the second problem is
that watermarks can be sometimes destroyed if the
receiver is malicious.
B. Secure Keyword Search Methods:
1. Latent Semantic Analysis (LSA) Technique:
It is a technique in Natural Language Processing
which analyzing the relationship between a set of
documents and terms. The statistical technique is
used to estimate the latent semantic structure and get
free from the unclear noise. This scheme tries to put
similar items near each other and return the file to
data user which contains the term latent semantically
associated with the query keyword [17].
LSA is used to find the semantic structure
between terms and documents and exacting a set of
correlated indexing variables or factors. K-Nearest
Neighbor technique encrypts the index and the
queried vector, so that obtain the exact ranked results
and defend the assurance of the data.
18
A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19
It is used to reveal the relationship between
terms and documents. Secure to achieve secure KNN search functionality. The disadvantage of this
approach is computational and time complexity.
2.
2. Reputation Mechanism:
The reputation mechanism is used to evaluate
the reference of each reporter [18]. The key
estimation of the reputation mechanism is to use a
reputation table to preserve a reputation score Spam
Report (SR) of each reporter depend on the previous
reliability record. Each introduced spam is given a
hunch score equal to reporter in the SR. The
malicious error reports and to achieve a near-zero
false positive rate, it carefully increase the reputation
score but drop it drastically while a false positive
error is emerged.
It is sufficiently reliable to classify the
subsequent near-duplicate e-mails as spams. The
phrases are rephrased after sometime if the features
are not too good the system would correctly
recognize as a spam.
3.
3. Wild-Card Based Technique:
It allows users to search for all files either name
that contain similar qualities. It solves the problem of
effective fuzzy keyword search over encrypted data
while asserting keyword privacy [19]. Gram based
technique for matching purpose only. It constructs
fuzzy set based on gram. Example, the sequence of
character to build will have 5 grams, CASTLE=
{CASTLE, CATLE, CSTLE, CASTE, CATLE and
CASLE}.
It provides security and privacy to keyword in
cloud storage. Rendering uses searching experience
is very frustrating. System efficiency is very low (i.e.
the system provides very much conjunction while
searching the keyword).
Conclusion:
There are many type of Network-based data leak
detection techniques are discussed above which have
some drawbacks. To overcome the drawback of
fuzzy fingerprint data leak detection model, a Hostbased Data-Leak Detection (DLD) method is
proposed. The main feature is that the detection does
not expose the content of the sensitive data and
enables the data owner to safely assign the detection.
Rather, only a small amount of specialized digests
are needed. This Host-based DLD technique may
improve the accuracy, privacy, concision and
efficiency of fuzzy fingerprint data leak detection.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
References
1.
Danfeng Yao, Elisa Bertino and Xiaokui Shu,
2015. “Privacy-Preserving Detection of
Sensitive Data Exposure” IEEE Transaction on
Information Forensic and Security, 10: 5.
14.
Amod Panchal, Grinal Tuscano, Humairah
Kotadiya, Rollan Fernandes and Vikrant Bhat,
2015. “ A Survey On Data Leakage Detection”,
Int. Journal of Engineering Research and
Applications, 5(4): 153-158.
Jung, J., A. Sheth, B. Greenstein, D. Wetherall,
G. Maganis and T. Kohno, 2008. ”Privacy
Oracle: A system for finding application leaks
with black box differential testing”, in proc. 15th
ACM Conf. Comput. Comm. Secur., pp: 279288.
Brintha Rajakumari, S., M. Mohamed Badruddin
and Qasim Uddin, 2015.” Secure Login of
Statistical Data with Two Parties” International
Journal on Recent and Innovation Trends in
Computing and Communication, 3: 3.
Butt, A.R., F. Liu, X. Shu and D. Yao, “Privacypreserving scanning of big content for sensitive
data exposure with MapReduce”, in Proc. ACM
CODASPY.
Emiliano De Cristofaro, 2015. Gene Tsudik and
Yanbin Lu, “Efficient Techniques for PrivacyPreserving Sharing of Sensitive Information”,
IEEE Transaction on Information Forensic and
Security.
Ameya Bhorkar, Tejas Bagade, Pratik Patil and
Sumit Somani, 2015. ” Preclusion of Insider
Data Theft Attacks in the Cloud “International
Journal of Advanced Research in Computer and
Communication Engineering, 4: 1.
Supriya Singh, 2013. “Data Leakage Detection
Using RSA Algorithm “, International Journal of
Application or Innovation in Engineering &
Management (IJAIEM), 2: 5.
Cova, M., A. Kapravelos, C. Kruegel, Y.
Shoshitaishvili and G. Vigna, 2013. “Revolver:
An automated approach to the detection of
evasive web-based malware”, inProc. 22nd
USENIX Secur. Symp., pp: 637-652.
Janga Ajay Kumar and K. Rajani Devi, 2012.”
An Efficient and Robust Model for Data Leakage
Detection System”, Journal of Global Research
in Computer Science, 3: 6.
Shu, X and D. Yao, 2012. “Data leak detection
as a service”, in Proc. 8th Int. Conf. Secur.
Privacy Commun. Netw, pp: 222-240.
Caesar, J and M. Croft., 2011. “Towards
practical avoidance of information leakage in
enterprise networks”, in Proc. 6th USENIX
Conf. Hot Topics Secur., p: 7.
Archana Vaidya, Kiran More, Nivedita Pandey,
Prakash Lahange and Shefali Kachroo, 2013.
“Data Leakage Detection”, International Journal
of Advances in Engineering & Technology,
3(1): 315-321.
Kulkarni, S.V. and Sandip A. Kale, 2012. “Data
Leakage Detection”, International Journal of
Advanced Research in Computer and
Communication Engineering, 1(9): 668-678.
19
A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19
15. Chandni Bhatt and Richa Sharma, 2014. “Data
Leakage Detection”, International Journal of
Computer
Science
and
Information
Technologies, 5: 2556-2558.
16. Anusha.Koneru, G. Siva Nageswara Rao and
J.Venkata Rao, 2013. “Data Leakage Detection
Using Encrypted Fake Objects”, International
Journal of P2P Network Trends and Technology,
3(2): 104-110.
17. Li Chen, Qi Liu, Xingming Sun and Zhihua
Xiaand, 2014. “An Efficient and PrivacyPreserving Semantic Multi-Keyword Ranked
Search over Encrypted Cloud Data”,
International Journal of Security and Its
Applications, 8(2): 323-332.
18. Latha, K., G. Menagagandhi, P. Nivedha and T.
Ramya, 2014. “Collaborative Anti Spam
Technique to Detect Spam Mails in E-Mails”,
Int. Journal of Scientific and Research
Publications, 4(3): 1-5
19. Cao, N., N. Li, W. Lou, K. Ren, C. Wang and Q.
Wang, 2010. “Fuzzy keyword search over
encrypted data in cloud computing”, IEEE
Journal pp: 1-5.