An Efficient Distortion Minimizing Technique for Watermarking Relational Databases G.SHYAMALA#1,C.KANIMOZHI*2, S.P.KAVYA#3 # Assistant Professor & Computer Science & Engg, Sri Shakthi Institute of Engg. & Tech Coimbatore. 1shyamala@siet.ac.in 2kanimozhi@siet.ac.in 3spkavya@siet.ac.in Abstract— Ownership protection is an important issue at present. Intelligent mining techniques must be used on data, extracted from relational databases, to detect interesting patterns (generally hidden in the data) that provide significant support to decision makers in making effective, accurate and relevant decisions; as a result, sharing of data between its owners and data mining experts (or corporations) is significantly increasing. In this work decoding accuracy is mainly focused from which only watermarking efficiency is calculated. Here the MD5 technique is used in both encoding and decoding which makes decoding to achieve maximum accuracy. The experimental results will show its efficiency of maximum decoding accuracy. Keywords— Data usability constraints, database watermarking, data quality, right protection, ownership protection, Reed– Solomon. I. INTRODUCTION Database technology has been used with great success in traditional business data processing. There is an increasing desire to use this technology in new application domains. One such application domain that is likely to acquire considerable significance in the near future is database mining. An increasing number of organizations are creating ultra large databases (measured in gigabytes and even terabytes) of business data, such as consumer data, transaction histories, sales records, etc.; such data forms a potential gold mine of valuable business information. In general, the database watermarking techniques consist of two phases: Watermark Embedding and Watermark Verification. During watermark embedding phase, a private key K (known only to the owner) is used to embed the watermark W into the original database. The watermarked database is then made publicly available. To verify the ownership of a suspicious database, the verification process is performed where the suspicious database is taken as input and by using the private key K (the same which is used during the embedding phase) the embedded watermark (if present) is extracted and compared with the original watermark information. Digital Watermarks for relational databases are potentially useful in many applications, including: Ownership Assertion, Fingerprinting, Fraud and Tamper Detection etc [11]. An intended recipient (Bob2) wants the data owner (Alice3) to define tight usability constraints so that he gets accurate data. For maximum robustness of watermark, Alice, on the other hand, wants to have larger bandwidth on manipulations performed during embedding of a watermark which is only possible if she puts soft usability constraints [1], [2]. To conclude, Bob and Alice have conflicting requirements: Bob wants “minimum distortions in the watermarked data” while Alice wants to produce “watermarked data having strong ownership”. Any watermark embedding technique that strives for a compromise bandwidth allows an attacker (Mallory4) to corrupt or remove the watermark by slightly surpassing the available bandwidth. The compromise bandwidth is achieved once Alice defines the usability constraint in such a way that the embedded watermark is not only robust but also causes minimum distortions to the underlying data. The job of analyzing the semantics of each application and use it to define usability constraints is not only cumbersome but also inefficient for a data owner. Remember, the robustness of a watermark is measured by the watermark decoding accuracy that in turn depends on the bandwidth available for manipulation [12]. In existing system they proposed a statisticalbased algorithm in which a database is partitioned into a maximum number of unique, nonintersecting subsets of tuples. The data partitioning concept is based on the use of special marker tuples, making it vulnerable to watermark synchronization errors particularly in the case of tuple insertion and deletion attacks, as the position of marker tuples is disturbed by these attacks. Such errors may be reduced if marker tuples are stored during watermark embedding phase and the same may be used for constructing the data partitions again during watermark decoding phase. But using the stored marker tuples to reconstruct the partitions violates the requirement of “blind decoding” of watermark. Furthermore, the threshold technique for bit decoding involves arbitrarily chosen thresholds – without following any optimality criteria – that are responsible for the error in the decoding process. The concept of usability bounds on data is used in this technique to control distortions introduced in the data during watermark embedding. However, an attacker can corrupt the watermark by launching large scale attacks on large number of rows. Moreover, the decoding accuracy is dependent on the usability bounds set by the data owner; as a result, the decoding accuracy is deteriorated if an attacker violates these bounds. An important shortcoming of this approach is that the data owner needs to specify usability constraints separately for every type of application that will use data [3]. This work lacks in some of the disadvantages like high distortion, not provides maximum decoding, solving ownership conflicts over watermarked dataset in case of additive attacks is not sufficient. This paper proposed a novel watermark decoding algorithm independent of the usability constraints (or available bandwidth). As a result, our approach facilitates Alice to define usability constraints only once for a particular database for every possible type of intended application. Moreover, it also ensures that the watermark introduces the least possible distortions to the original data without compromising the robustness of the inserted watermark. The proposed algorithm embeds every bit of a multi-bit watermark (generated from datetime) in each selected row (in a numeric attribute) with the objective of having maximum robustness even if an attacker is somehow able to successfully corrupt the watermark in some selected part of the dataset. The proposed system also proves the robustness of our watermarking scheme by analyzing its decoding accuracy under different types of malicious attacks using a real world dataset. It also provide solutions to resolve conflicting ownership issues in case of the additive attack in which Mallory inserts his own watermark in Alice’s watermarked database. II. LITERATURE REVIEW Decision Tree classification (DT): A Decision Tree, more properly a classification tree, is mostly used to learn a classification model which predicts the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables). This solves a problem known as supervised classification since the dependent attribute and the number of classes (values) that it may have are given. [4]. In decision tree structures, leaves signify classifications and branches signify a combination of characteristics that direct to those classifications. Support Vector Machine (SVM) Classification: The aim of SVMs is to learn a model which forecasts class tag of cases in the testing set. This classification algorithm is one of the most robust classifiers for two-class classification. SVM can manage both linear and nonlinear classification problems. For linear discrete problems, SVM classifiers purely explore for a hyper-plane that distinguishes negative and positive instances [5]. K-Nearest Neighbour classification (KNN): One kind of supervised classification techniques is called K-nearest neighbour classifier. KNN presented by Devijver and Kittler (Devijver & Kittler, 1982) [6], commonly use the Euclidean distance measure. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote. The technique used to hide a small amount of digital data in a digital signal in such a way that it cannot be detected by a standard playback device or viewer. In Digital Watermarking, an indelible and invisible message is embedded into both the image and the audio track of the motion picture as it passes through the server. According to R. Agarwal, piracy of digital assets such as image, video, audio and text can be protected by inserting a digital watermark into the data thus providing a promising way to protect digital data from illicit copying and manipulation [5]. After embedding the watermark, the data and watermark are in-separable [7]. The security of relational databases has been a great concern since the expanded use of these data over the Internet. Because data allow unlimited number of copies of an “original” without any quality loss and can also be easily distributed and forged [7]. Hence, Digital watermarking for relational databases emerged as a candidate solution to provide copyright protection, tamper detection, traitor tracing and maintaining integrity of relational data [8]. III. PROPOSED WORK The proposed work is divided into preprocessing, Watermark Bits Generation, Data Partitioning, Selection of dataset for watermarking, Watermark Embedding, Watermark decoding and finally Majority Voting. In Preprocessing the non-numerical dataset which contain numerical and image data is given for preprocess. Preprocessing is vital step which removes unwanted datas from the dataset and noises present in the image. Create DB Decoding Using MD5 Generate Watermarking Bits (DW) Decode DB Decoding Accuracy Calculation Calculate Adaptive Threshold MD5 Embedded Bits DW1 Attacker Channel Figure 1: Stages of Watermark Encoding and Decoding A. Data Partitioning Bucketization can be viewed as a special case of slicing used here for data partitioning, where there are exactly two columns: one column contains only the SA, and the other contains all the QIs. The advantages of slicing over bucketization can be understood as follows: First, by partitioning attributes into more than two columns, slicing can be used to prevent membership leak. Our empirical evaluation on a real data set shows that bucketization does not prevent membership disclosure. Second, unlike bucketization, which requires a clear separation of QI attributes and the sensitive attribute, slicing can be used without such a separation. For data set such as the census data, one often cannot clearly separate QIs from SAs because there is no single external public database that one can use to determine which attributes the adversary already knows. Slicing can be useful for such data. Finally, by allowing a column to contain both some QI attributes and the sensitive attribute, attribute correlations between the sensitive attribute and the QI attributes are preserved. In this step, the data set is partitioned according to three factors: (1) the number of buckets M desired by the user, (2) for M buckets, the proportion p of buckets allocated to the head bucket and the proportion allocated to the tail bucket1-p, and (3) the proportion of query mass c represented by the head bucket. Initially, the distribution is split into a head and tail, where the head contains data values representing a proportion of the total query mass – normally between 70 and 90 percent – as defined by c. The tail contains mass of the proportion 1-c. The precise value of c is specified by the user prior to running the algorithm. For the remaining M-1 buckets, QSB will partition the head bucket into at most ⌈M*p⌉ buckets, then the tail into at least ⌊M*(1-p)⌋ buckets. B. Selection of dataset for watermarking The following two steps are applied on the dataset to select tuples for watermarking. Threshold Computation and Hash Value Computation. 1) Threshold Computation: Adaptive threshold technique is used in this step. Compute Adaptive threshold values for find a tuples in a database by using centroid. Adaptive Threshold= C* Mean + Standard Deviation Where, C is the constant tuple, Mean is the mean of tuples and standard deviation is the standard deviation of the tuples. 2) Hash Value Computation: In this step, a cryptographic hash function MD5 is applied on the selected dataset to select only those tuples which have an even hash value. This step achieves two objectives: (i) it further enhances the watermark security by hiding the identity of the watermarked tuples from an intruder; and (2) it further reduces the number of to-be-watermarked tuples to limit distortions in the dataset. The dataset is used to select tuples with even hash values and put them in the dataset . The dataset , consisting of ζ tuples, is the subpart of the dataset D and is not physically separated from the rest of the parts of D. As selection of ζ tuples is also based on the value of data selection threshold, Mallory may try to corrupt the embedded watermark by changing the data values such that the data selection threshold value is disturbed and hence Alice is unable to detect the watermarked tuples in the watermark detection phase. But, since Mallory has no knowledge of the confidence factor c; therefore, he may be able to only arbitrarily attack some selected part of the watermarked data to corrupt the watermark with some probability P. This probability is made smaller by using data selection threshold T and even hashes values. C. Generation of Watermark bits Reed–Solomon (RS) codes are non-binary cyclic error-correcting codes which are used here for generation watermark bits. In Reed–Solomon coding, source symbols are viewed as coefficients of a polynomial p(x) over a finite field. The original idea was to create n code symbols from k source symbols by oversampling p(x) at n > k distinct points, transmit the sampled points, and use interpolation techniques at the receiver to recover the original message. Reed-Solomon (R-S) codes in terms of the parameters n, k, t, and any positive integer m > 2 can be expressed as Where n - k = 2t is the number of parity symbols, and t is the symbol-error correcting capability of the code. The generating polynomial for an R-S code takes the following form: The degree of the generator polynomial is equal to the number of parity symbols. R-S codes are a subset of the Bose, Chaudhuri, and Hocquenghem (BCH) codes; hence, it should be no surprise that this relationship between the degree of the generator polynomial and the number of parity symbols holds, just as for BCH codes. Since the generator polynomial is of degree , there must be precisely successive powers of that are roots of the polynomial. We designate the roots of as , , …, . It is not necessary to start with the root starting with any power of is possible [13]. D. Encoding Encoding tuples and append watermarking bits using MD5 and Adaptive threshold technique. MD5 The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. Append padding bits The input message is "padded" (extended) so that its length (in bits) equals to 448 mod 512. Padding is always performed, even if the length of the message is already 448 mod 512. Padding is performed as follows: a single "1" bit is appended to the message, and then "0" bits are appended so that the length in bits of the padded message becomes congruent to 448 mod 512. At least one bit and at most 512 bits are appended. Append length A 64-bit representation of the length of the message is appended to the result of step1. If the length of the message is greater than 2^64, only the low-order 64 bits will be used. The resulting message (after padding with bits and with b) has a length that is an exact multiple of 512 bits. The input message will have a length that is an exact multiple of 16 (32-bit) words. Initialize MD buffer A four-word buffer (A, B, C, D) is used to compute the message digest. Each of A, B, C, D is a 32-bit register. These registers are initialized to the following values in hexadecimal, low-order bytes first): word A: 01 23 45 67 word B: 89 ab cd ef word C: fe dc ba 98 word D: 76 54 32 10 Process message in 16-word blocks Four functions will be defined such that each function takes an input of three 32-bit words and produces a 32-bit word output. F (X, Y, Z) = XY or not (X) Z G (X, Y, Z) = XZ or Y not (Z) H (X, Y, Z) = X xor Y xor Z I (X, Y, Z) = Y xor (X or not (Z)) Figure 2: The structure of MD5 algorithm (Rivest, 1992) A. Watermark Decoding: The watermarked bit is not given for decoding algorithm. The remaining bit is subjected to MD5 decoding. Decoding an MD5 hash without knowing the original value of the encoded string is not totally accurate, there are some values that may have a higher degree of certainty because they may form recognizable elements, such as existing words, using a dictionary approach to look-up the hash and compare with known-existing values. Some MD5 reverse look-up databases contain millions of hashes and their corresponding decoded values. This is generally considered the easiest method as it can be executed in mere fractions of a second. A second method uses a more brute-force approach by using tables – commonly known as “rainbow tables” – to analyze the encoded MD5 elements. Neither approach has 100% certainty of successful decoding, however, the possibility that it might has caused MD5 to be identified as technically insecure by National Security Agency (NSA) standards. IV. EXPERIMENTAL RESULT In this section, the experimental results are calculated for the proposed work. For this subset of 50000 tuples are selected from a real-life dataset. This work has been performed on a server that has Pentium(R) Dual-Core CPU 2.10GHz with 4GB of RAM in JAVA platform. The decoding accuracy is calculated for the task given is shown in below figure 3. The decoding results of the watermarked dataset uses the majority voting scheme to eliminate decoding errors (if any) as a result of malicious attack (or attacks). Figure 3: Decoding Result. Watermarks. In DRM ’04: Proceedings of the 4th ACM Workshop on Digital Rights Management, pages 73–82. ACM Press, 2004. [9] M. Shehab, E. Bertino, and A. Ghafoor, “Watermarking relational databases using V. CONCLUSION: optimization-based techniques,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, The watermarking data suffers from serious of pp. 116–129, 2008. attacks in many methods. In this work security [10] B. Schneier, Applied Cryptography. John Wiley, mechanism helps to resolve ownership conflicts 1996. over watermarked dataset in case of additive attacks. [11] Raju Halder, Shantanu Pal and Agostino Cortesi,” And this method provides less distortion and Watermarking Techniques for Relational Databases: maximum accuracy for decoding. It proves the Survey, Classification and Comparison”, Journal of robustness of watermarking scheme by analyzing Universal Computer Science, vol. 16, no. 21 (2010), its decoding accuracy under different types of 3164-3190 . malicious attacks using a real world dataset. It also [12] M. Kamran, Sabah Suhail, and Muddassar Farooq, “A Robust, Distortion Minimizing Technique for provide solutions to resolve conflicting ownership Watermarking Relational Databases Using Onceissues in case of the additive attack in which for-All Usability Constraints” IEEE Transactions Mallory inserts his own watermark in Alice’s on Knowledge and Data Engineering, Dec. 2013 watermarked database. (vol. 25 no. 12). [13] Bernard Sklar, “Reed-Solomon Codes”. REFERENCES [1] R. Agrawal, P. Haas, and J. Kiernan, “Watermarking relational data: framework, algorithms and analysis,” The VLDB Journal, vol. 12, no. 2, pp. 157–169, 2003. [2] M. Shehab, E. Bertino, and A. Ghafoor, “Watermarking relational databases using optimization-based techniques,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp. 116–129, 2008. [3] R. Sion, M. Atallah, and S. Prabhakar, “Rights protection for relational data,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 6, pp. 1509–1525, 2004. [4] Kaipa, B. and Robila, S.A.. (2010). Statistical Steganalyis of Images Using Open Source Software, Applications and Technology Conference (LISAT), 2010 Long Island Systems, pp.1-5, ISBN: 978-14244-5550-8, May 2010, Farmingdale, NY. [5] Cortes, C and Vapnik, V. (1995) Support-vector networks, in Machine Learning, Vol.20, No.3, pp. 273-297, ISSN: 0885-6125. [6] Devijver, P.A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Prentice-Hall Internat., London. [7] R. Agarwal and J Kieman, Watermarking relational databases In Proceedings of 28th International In Proceedings of 28th International Conference on very large databases, Hong Kong, China, 2002 [8] Y. Li, H. Guo, and S. Jajodia. Tamper Detection and Localization for Categorical Data Using Fragile From figure 3, the results of the decoding show maximum accuracy. Thus it eliminates maximum attacks and errors and decodes maximum given input dataset.