Copyright © 2015, American-Eurasian Network for Scientific Information publisher JOURNAL OF APPLIED SCIENCES RESEARCH ISSN: 1819-544X EISSN: 1816-157X JOURNAL home page: http://www.aensiweb.com/JASR 2015 October; 11(19): pages 14-19. Published Online 10 November 2015. Research Article A Survey on Secrecy Preserving Discovery of Subtle Statistic Contents 1A.Francis 1PG Thivya and 2P.Tharcis Scholar, Department of Computer Science and Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. Professor, Department of Electronics and Communication Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. 2Assistant Received: 23 September 2015; Accepted: 25 October 2015 © 2015 AENSI PUBLISHER All rights reserved ABSTRACT A Data Distributor or Software Company has given sensitive data to a set of confidential data owners. Sometimes data is leaked and found in the unauthorized place. Organization may outsource the data after pre-processing; data might be given to various clients. Data leakage happens every day when the sensitive information is transferred to the third party agents when this information is leaked out, and then the companies are at serious problem. To manage and protect the confidential information, data leak detection is an important part. Data leaks on organization networks may be caused by mistakes of users. There exist solution the semi-honest data owner detect the sensitive data with false positive alarms and keyword do not cover enough sensitive data segments for data leak detection. In this paper, we surveyed how to detect the data leaks by using various techniques and how to securely transfer the data by using various encryption algorithms. Keywords: Data leak detection provider, Fuzzy fingerprint, Host-based data leak detection, Semi-honest adversary. INTRODUCTION Information Security has always had an important role as technology has advanced; it has become one of the hottest topics in last few decades. Information Forensic and Security is the investigation and analysis technique to gather and preserve information from a particular computer device. Information Security is the set of business processes that protects information assets. It doesn’t concentrate on how the information is formatted or produced. Information security is very important for maintaining the availability, confidentiality and integrity of information technology system and business data. Most of the large enterprises employ a dedicated security group to implement and manage the organizations information security program. Detecting and preventing data leak involves a set of solutions including data confinement [6], stealthy malware detection, policy enforcements and data leak detection. Typical approaches to preventing data leaks are under two categories [10] -- i) Host-based Solution and ii) Network-based Solution. Network-based data leak detection focus on analyzing unencrypted outbound network traffic through i) Deep packet inspection and ii) Information theoretic analysis. In deep packet inspection, inspecting every packet for the occurrence of the sensitive data defined in database. This solution generates alert, if the sensitive data found in the outgoing network traffic [1]. Network based Data leak detection accompaniments host-based Approach. Host based data leak detection typically perform i) Encrypt the data when not used. ii) Detecting stealthy malware with antivirus scanning the Host. iii) Enforcing policies to restrict the transfer of sensitive data [10]. In Host-based data leak detection approach, the data owner computes a set of digests from the sensitive data and reveals only a small amount to the Data Leak Detection (DLD) providers. The DLD provider computes the fingerprint from network traffic and identifies the potential leaks. The Corresponding Author: P.Tharcis, Assistant Professor, Department of Electronics and Communication Engineering, Christian College of Engineering and Technology, Dindigul, Tamilnadu-624619, India. 15 A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19 collection of leakage consists of real leaks and noises [1], [15]. The data owner perform post-process to determine whether there is any real leak by sending potential leak to the DLD provider. The DLD provider obtains digests the sensitive data. The data owner uses Robin fingerprint algorithm and a sliding window to generate the one way computation through the fast polynomial modulus operation (i.e. short and hard-to-reverse) [10], [13]. To detect the data leak and achieve the privacy, data owner generates the set of digests called fuzzy fingerprints. Fuzzy fingerprint is used to hide the sensitive data in the network traffic [10]. This system introduces the matching instance on DLD provider side without increasing false positive alarms to the data owners. The DLD provider detects leaks by Range-based comparison rather than exact match. The DLD provider is a semi-honest adversary [1]. The DLD provider performs inspection on network traffic but the provider may attempt to learn about the information about sensitive data provided by data owner. Fig. 1: Flow diagram for data leak detection model Fig.1 illustrates the flow of data leak detection model. In DLD model each and every data owner has its individual username and password. If the Data owner wants to transfer the data to the receiver, data owner must perform the preprocessing operation to remove the noise and make the data as consistent. Fuzzy fingerprint are generated to hide the sensitive data in crowd (Network Traffic). SHA algorithm is used to provide the hash value of the document. The data is transferred through the DLD provider to identify potential data leaks. DLD provider again generates the hash value of the data and compared with the original hash value to identify leaks. Data owner computes the Data leak which is send by the DLD provider by comparing the fuzzy value with the original fuzzy fingerprint set. To overcome the Data leak shortest path is used to transfer data from the data owner to the receiver with minimum traffic. The reminder of this paper is organized as follows. Section II, describes the Related Works. Section III summarizes the conclusion. II. Related Work: 16 A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19 A number of cryptographic primitives provide network-based privacy preservation methods – i) Data leak detection methods and ii) Secure keyword search methods that are overview them below. A. Data Leak Detection Methods: 1. Decoy Information Technique: A malicious insider can steal confidential data of the cloud user in spite of provider taking precaution steps i) Not to allow the physical access, ii) Zero tolerance policy for insiders that access the data storage and iii) logging all access to the service and use internal audits to find the malicious insider [7]. It uses the Decoy information. Decoy is bogus (fake) information which can be generated on demand and serve as a mean of detecting unauthorized access to information. This technology is combined with user behavior profiling technology to secure user information in cloud. Based on the user activity cloud server send the original or decoy file. The Decoys serve two purposes: i) it validates whether data access is authorized when abnormal information is detected and ii) Confusing the attackers with bogus information. It is used to detect the user behavior and provide security to the original file but it cannot fulfill the privacy demanded by common cloud computing services. 2. Fake Object Technique: Fake object are basically change in original data which apparently improves the chance to find guilty agents [2]. The company may be able to add fake objects to the distributed data in order to improve distributor effectiveness in detecting guilty agents. The idea of disturbing or altering data to detect leak is not new. It can identify unauthorized use of data. The main disadvantage is it leads to alteration of original data. 3. Secure Login Technique: Data distribution can be manipulated for better data mining to gain better decision and support [4]. Data integration means that no sensitive data can be revealed during private data mining. Secure Hashing Algorithm (SHA) provides more security and privacy data mining model. This method addresses the problem of releasing private data, different datasets contain the same set of user are held by two parties. In data storage and encryption, once data was stored into the web server; the cloud server will divide the data into many portions and store all the data in the separate server. Data Integration and source code are stored in the data server. Data Admin is a person integrates the data in the web server. Before integration, the admin has to be registered on server. Once the data holder/ admin registered then it will be encrypted. The main purpose of this technique is to provide data integration (i.e.) different type of key is assigned to determine the authentication over private data. But, it cannot recuperate the unique data and may provide collision. 4. Privacy Shield Method: It is a simple Database query application with two parties: a server in possession of a Database and a client performing disjunctive equality queries. Privacy shield enables a client and server to exchange data without leaking more than the required data [6]. Isolated Box (IB) can be implemented as a part of secure hardware. IB is a non-colluding, untrusted party connected with both the server and client. This method explores the privacy preserving sharing of sensitive information. It provides a concrete and efficient representation modeled in the context of simple Database query. It is secure and efficient techniques to protect the information. The testing database has 45 searchable attributes and 1 Unsearchable attribute. 5. MapReduce Technique: MapReduce is a programming framework for distributed data intensive application. It has been used to solve surety problem such as spam filtering, internet traffic analysis and log analysis. Private multiparty set interaction method exist, the high computational overhead is a concern for time sensitive security application such as Data leak detection [5]. It has the ability to scale and utilize public resource for the task. In this method, MapReduce nodes scan contents in data storage or network transmission for leak without learning what the sensitive data is. DLD provider receives both sensitive fingerprint collections and content using two-phase MapReduce algorithm. By computing the interaction each content and sensitive collection pair, its output determines whether the sensitive data is leaked or not. It supports the secure outsourcing of the data leak detection to untrusted MapReduce and cloud providers. The main drawback is a significant portion of incident is caused by unintentional mistakes of data owners. 6. Revolver Technique: Revolver uses efficient technique to identify the resemblances between a large numbers of JavaScript programs and to automatically interpret their difference to detect evasion. It leverages the observation that two scripts which are similar should be classified by web malware detectors [9]. If Revolver finds similarities between a malicious and the benign scripts, it performs an additional classification step. Similarities can be classified by Revolver into 4 possible categories: Evasions, injections, data dependencies and general evasions. This technique is used to find the similarities between JavaScript files. 17 A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19 Revolver analyses the Abstract Syntax Tree (AST) of particular script instead examining web pages as a whole. It perform refinement operation i) Individual AST are extracted from the web pages obtained from the oracle. ii) Detection status is determined based on the page classification iii) for each node in an AST is recorded to determine whether the corresponding statement was executed. 7. Data Leak Detection Using RSA Algorithm: This approach uses encrypted object to determine the agent that leaked data. In encryption, the information is encoded that it can be read by authorized party and not by the unauthorized. If unauthorized agent receives data encryption keeps in unreadable hand encryption increases confidentiality [8]. The automatically notification will send to the distributor defining the agent leaked the particular set of records. It protects the data from unauthorized users. It may reduce the leakage through RSA algorithm and b y the process of asymmetric key encryption algorithm for the fake object creation. The important drawback is , in symmetric encryption a single key is used to encryption and decryption so the key security become problem. 8. Black Box Differencing: It detects information leakage, in which a process depends on private data is determined by duplicating the process and provide different input. The two copies of the data is compared, in the absence of private data leakage, it match their runtime behavior [3], [12]. It provides a network wide method of confining and assuring the flow of sensitive data within the network. Network based method detects information leaks by forking copies of process consuming private data and removing the sensitive data from the input. Black box differencing is used to mitigate leakage of private data. It uses data of same size to prevent odd behavior or crashes in application due to unexpected input size. The disadvantage of this approach is it cannot able to monitor encrypted network traffic without encryption keys. The scrubbing sensitive data to limit false positive and possible corruption of application receive scrubbing input. 9. Fuzzy fingerprint Data leak detection method: This approach focus on analyzing the unencrypted outbound network traffic [1],[11] by i) Deep packet inspection and ii) information theoretic analysis. In deep packet inspection, every packet is examined to identify the occurrence of sensitive data. This is also referred to as fuzzy fingerprint data leak detection model. The main characteristic of which is that the detection does not require the data owner to expose the content of stored sensitive data. Fuzzy fingerprint algorithm can be used to detect accidental data leaks due to human error and / or application flaws and also this algorithm is used to hide the sensitive data in crowd. It provide a quantified method to measure the privacy guarantee and it provide accurate data leak detection with small number of false positive alarms. But, it is not efficient enough for practical data leak inspection and also computational complexity and realization difficulty. 10. Agent Guilt Model: Data leak may found in unauthorized place in the organization. It is either observed or not observed by the data owner. Data watcher model is used to identify data leaks and Guilt model capture leakage where agents can conspire and identify fake tuples. Guilt model is used to improve the possibility of identifying guilty third parties [10], [13], [16]. It adds fake object to the distributed set is act as a type of watermark for entire set, without modifying any individual member. If an agent gave one or more fake object that was leaked, then the distributor can be confident that the agent was guilty. It improves the distributor chance to identify a leaker or guilt agent. The important drawback is that data leak observed or sometimes not observed by the data owner. 11. Watermarking Technique: Data leak detection is covered by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later detected in an unauthorized person, the leaker can be identified [14], [15]. Watermarks can be very useful but involve some modification of the original data. Example a company may have partnerships with other company that requires sharing customer’s data. Another company may outsource its data processing, so data must be given to various other companies. There are two major disadvantages. It involves some modification of data and the second problem is that watermarks can be sometimes destroyed if the receiver is malicious. B. Secure Keyword Search Methods: 1. Latent Semantic Analysis (LSA) Technique: It is a technique in Natural Language Processing which analyzing the relationship between a set of documents and terms. The statistical technique is used to estimate the latent semantic structure and get free from the unclear noise. This scheme tries to put similar items near each other and return the file to data user which contains the term latent semantically associated with the query keyword [17]. LSA is used to find the semantic structure between terms and documents and exacting a set of correlated indexing variables or factors. K-Nearest Neighbor technique encrypts the index and the queried vector, so that obtain the exact ranked results and defend the assurance of the data. 18 A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19 It is used to reveal the relationship between terms and documents. Secure to achieve secure KNN search functionality. The disadvantage of this approach is computational and time complexity. 2. 2. Reputation Mechanism: The reputation mechanism is used to evaluate the reference of each reporter [18]. The key estimation of the reputation mechanism is to use a reputation table to preserve a reputation score Spam Report (SR) of each reporter depend on the previous reliability record. Each introduced spam is given a hunch score equal to reporter in the SR. The malicious error reports and to achieve a near-zero false positive rate, it carefully increase the reputation score but drop it drastically while a false positive error is emerged. It is sufficiently reliable to classify the subsequent near-duplicate e-mails as spams. The phrases are rephrased after sometime if the features are not too good the system would correctly recognize as a spam. 3. 3. Wild-Card Based Technique: It allows users to search for all files either name that contain similar qualities. It solves the problem of effective fuzzy keyword search over encrypted data while asserting keyword privacy [19]. Gram based technique for matching purpose only. It constructs fuzzy set based on gram. Example, the sequence of character to build will have 5 grams, CASTLE= {CASTLE, CATLE, CSTLE, CASTE, CATLE and CASLE}. It provides security and privacy to keyword in cloud storage. Rendering uses searching experience is very frustrating. System efficiency is very low (i.e. the system provides very much conjunction while searching the keyword). Conclusion: There are many type of Network-based data leak detection techniques are discussed above which have some drawbacks. To overcome the drawback of fuzzy fingerprint data leak detection model, a Hostbased Data-Leak Detection (DLD) method is proposed. The main feature is that the detection does not expose the content of the sensitive data and enables the data owner to safely assign the detection. Rather, only a small amount of specialized digests are needed. This Host-based DLD technique may improve the accuracy, privacy, concision and efficiency of fuzzy fingerprint data leak detection. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. References 1. Danfeng Yao, Elisa Bertino and Xiaokui Shu, 2015. “Privacy-Preserving Detection of Sensitive Data Exposure” IEEE Transaction on Information Forensic and Security, 10: 5. 14. Amod Panchal, Grinal Tuscano, Humairah Kotadiya, Rollan Fernandes and Vikrant Bhat, 2015. “ A Survey On Data Leakage Detection”, Int. Journal of Engineering Research and Applications, 5(4): 153-158. Jung, J., A. Sheth, B. Greenstein, D. Wetherall, G. Maganis and T. Kohno, 2008. ”Privacy Oracle: A system for finding application leaks with black box differential testing”, in proc. 15th ACM Conf. Comput. Comm. Secur., pp: 279288. Brintha Rajakumari, S., M. Mohamed Badruddin and Qasim Uddin, 2015.” Secure Login of Statistical Data with Two Parties” International Journal on Recent and Innovation Trends in Computing and Communication, 3: 3. Butt, A.R., F. Liu, X. Shu and D. Yao, “Privacypreserving scanning of big content for sensitive data exposure with MapReduce”, in Proc. ACM CODASPY. Emiliano De Cristofaro, 2015. Gene Tsudik and Yanbin Lu, “Efficient Techniques for PrivacyPreserving Sharing of Sensitive Information”, IEEE Transaction on Information Forensic and Security. Ameya Bhorkar, Tejas Bagade, Pratik Patil and Sumit Somani, 2015. ” Preclusion of Insider Data Theft Attacks in the Cloud “International Journal of Advanced Research in Computer and Communication Engineering, 4: 1. Supriya Singh, 2013. “Data Leakage Detection Using RSA Algorithm “, International Journal of Application or Innovation in Engineering & Management (IJAIEM), 2: 5. Cova, M., A. Kapravelos, C. Kruegel, Y. Shoshitaishvili and G. Vigna, 2013. “Revolver: An automated approach to the detection of evasive web-based malware”, inProc. 22nd USENIX Secur. Symp., pp: 637-652. Janga Ajay Kumar and K. Rajani Devi, 2012.” An Efficient and Robust Model for Data Leakage Detection System”, Journal of Global Research in Computer Science, 3: 6. Shu, X and D. Yao, 2012. “Data leak detection as a service”, in Proc. 8th Int. Conf. Secur. Privacy Commun. Netw, pp: 222-240. Caesar, J and M. Croft., 2011. “Towards practical avoidance of information leakage in enterprise networks”, in Proc. 6th USENIX Conf. Hot Topics Secur., p: 7. Archana Vaidya, Kiran More, Nivedita Pandey, Prakash Lahange and Shefali Kachroo, 2013. “Data Leakage Detection”, International Journal of Advances in Engineering & Technology, 3(1): 315-321. Kulkarni, S.V. and Sandip A. Kale, 2012. “Data Leakage Detection”, International Journal of Advanced Research in Computer and Communication Engineering, 1(9): 668-678. 19 A.Francis Thivya and P.Tharcis, 2015 /Journal Of Applied Sciences Research 11(19), October, Pages: 14-19 15. Chandni Bhatt and Richa Sharma, 2014. “Data Leakage Detection”, International Journal of Computer Science and Information Technologies, 5: 2556-2558. 16. Anusha.Koneru, G. Siva Nageswara Rao and J.Venkata Rao, 2013. “Data Leakage Detection Using Encrypted Fake Objects”, International Journal of P2P Network Trends and Technology, 3(2): 104-110. 17. Li Chen, Qi Liu, Xingming Sun and Zhihua Xiaand, 2014. “An Efficient and PrivacyPreserving Semantic Multi-Keyword Ranked Search over Encrypted Cloud Data”, International Journal of Security and Its Applications, 8(2): 323-332. 18. Latha, K., G. Menagagandhi, P. Nivedha and T. Ramya, 2014. “Collaborative Anti Spam Technique to Detect Spam Mails in E-Mails”, Int. Journal of Scientific and Research Publications, 4(3): 1-5 19. Cao, N., N. Li, W. Lou, K. Ren, C. Wang and Q. Wang, 2010. “Fuzzy keyword search over encrypted data in cloud computing”, IEEE Journal pp: 1-5.