IEEE Paper Template in A4 (V1)

advertisement
An Efficient Distortion Minimizing Technique for
Watermarking Relational Databases
G.SHYAMALA#1,C.KANIMOZHI*2, S.P.KAVYA#3
#
Assistant Professor & Computer Science & Engg,
Sri Shakthi Institute of Engg. & Tech
Coimbatore.
1shyamala@siet.ac.in
2kanimozhi@siet.ac.in
3spkavya@siet.ac.in
Abstract— Ownership protection is an important issue at
present. Intelligent mining techniques must be used on data,
extracted from relational databases, to detect interesting patterns
(generally hidden in the data) that provide significant support to
decision makers in making effective, accurate and relevant
decisions; as a result, sharing of data between its owners and
data mining experts (or corporations) is significantly increasing.
In this work decoding accuracy is mainly focused from which
only watermarking efficiency is calculated. Here the MD5
technique is used in both encoding and decoding which makes
decoding to achieve maximum accuracy. The experimental
results will show its efficiency of maximum decoding accuracy.
Keywords— Data usability constraints, database watermarking,
data quality, right protection, ownership protection, Reed–
Solomon.
I. INTRODUCTION
Database technology has been used with great
success in traditional business data processing.
There is an increasing desire to use this technology
in new application domains. One such application
domain that is likely to acquire considerable
significance in the near future is database mining.
An increasing number of organizations are creating
ultra large databases (measured in gigabytes and
even terabytes) of business data, such as consumer
data, transaction histories, sales records, etc.; such
data forms a potential gold mine of valuable
business information.
In general, the database watermarking techniques
consist of two phases: Watermark Embedding and
Watermark Verification. During watermark
embedding phase, a private key K (known only to
the owner) is used to embed the watermark W into
the original database. The watermarked database is
then made publicly available. To verify the
ownership of a suspicious database, the verification
process is performed where the suspicious database
is taken as input and by using the private key K (the
same which is used during the embedding phase)
the embedded watermark (if present) is extracted
and compared with the original watermark
information. Digital Watermarks for relational
databases are potentially useful in many
applications, including: Ownership Assertion,
Fingerprinting, Fraud and Tamper Detection etc
[11].
An intended recipient (Bob2) wants the data
owner (Alice3) to define tight usability constraints
so that he gets accurate data. For maximum
robustness of watermark, Alice, on the other hand,
wants to have larger bandwidth on manipulations
performed during embedding of a watermark which
is only possible if she puts soft usability constraints
[1], [2]. To conclude, Bob and Alice have
conflicting requirements: Bob wants “minimum
distortions in the watermarked data” while Alice
wants to produce “watermarked data having strong
ownership”. Any watermark embedding technique
that strives for a compromise bandwidth allows an
attacker (Mallory4) to corrupt or remove the
watermark by slightly surpassing the available
bandwidth. The compromise bandwidth is achieved
once Alice defines the usability constraint in such a
way that the embedded watermark is not only
robust but also causes minimum distortions to the
underlying data. The job of analyzing the semantics
of each application and use it to define usability
constraints is not only cumbersome but also
inefficient for a data owner. Remember, the
robustness of a watermark is measured by the
watermark decoding accuracy that in turn depends
on the bandwidth available for manipulation [12].
In existing system they proposed a statisticalbased algorithm in which a database is partitioned
into a maximum number of unique, nonintersecting
subsets of tuples. The data partitioning concept is
based on the use of special marker tuples, making it
vulnerable to watermark synchronization errors
particularly in the case of tuple insertion and
deletion attacks, as the position of marker tuples is
disturbed by these attacks. Such errors may be
reduced if marker tuples are stored during
watermark embedding phase and the same may be
used for constructing the data partitions again
during watermark decoding phase. But using the
stored marker tuples to reconstruct the partitions
violates the requirement of “blind decoding” of
watermark. Furthermore, the threshold technique
for bit decoding involves arbitrarily chosen
thresholds – without following any optimality
criteria – that are responsible for the error in the
decoding process. The concept of usability bounds
on data is used in this technique to control
distortions introduced in the data during watermark
embedding. However, an attacker can corrupt the
watermark by launching large scale attacks on large
number of rows. Moreover, the decoding accuracy
is dependent on the usability bounds set by the data
owner; as a result, the decoding accuracy is
deteriorated if an attacker violates these bounds. An
important shortcoming of this approach is that the
data owner needs to specify usability constraints
separately for every type of application that will use
data [3]. This work lacks in some of the
disadvantages like high distortion, not provides
maximum decoding, solving ownership conflicts
over watermarked dataset in case of additive attacks
is not sufficient.
This paper proposed a novel watermark decoding
algorithm independent of the usability constraints
(or available bandwidth). As a result, our approach
facilitates Alice to define usability constraints only
once for a particular database for every possible
type of intended application. Moreover, it also
ensures that the watermark introduces the least
possible distortions to the original data without
compromising the robustness of the inserted
watermark. The proposed algorithm embeds every
bit of a multi-bit watermark (generated from datetime) in each selected row (in a numeric attribute)
with the objective of having maximum robustness
even if an attacker is somehow able to successfully
corrupt the watermark in some selected part of the
dataset. The proposed system also proves the
robustness of our watermarking scheme by
analyzing its decoding accuracy under different
types of malicious attacks using a real world dataset.
It also provide solutions to resolve conflicting
ownership issues in case of the additive attack in
which Mallory inserts his own watermark in Alice’s
watermarked database.
II. LITERATURE REVIEW
Decision Tree classification (DT): A Decision
Tree, more properly a classification tree, is mostly
used to learn a classification model which predicts
the value of a dependent attribute (variable) given
the values of the independent (input) attributes
(variables). This solves a problem known as
supervised classification since the dependent
attribute and the number of classes (values) that it
may have are given. [4]. In decision tree structures,
leaves signify classifications and branches signify a
combination of characteristics that direct to those
classifications.
Support Vector Machine (SVM) Classification:
The aim of SVMs is to learn a model which
forecasts class tag of cases in the testing set. This
classification algorithm is one of the most robust
classifiers for two-class classification. SVM can
manage both linear and nonlinear classification
problems. For linear discrete problems, SVM
classifiers purely explore for a hyper-plane that
distinguishes negative and positive instances [5].
K-Nearest Neighbour classification (KNN): One
kind of supervised classification techniques is
called K-nearest neighbour classifier. KNN
presented by Devijver and Kittler (Devijver &
Kittler, 1982) [6], commonly use the Euclidean
distance measure. For each row of the test set, the k
nearest (in Euclidean distance) training set vectors
are found, and the classification is decided by
majority vote, with ties broken at random. If there
are ties for the kth nearest vector, all candidates are
included in the vote.
The technique used to hide a small amount of
digital data in a digital signal in such a way that it
cannot be detected by a standard playback device or
viewer. In Digital Watermarking, an indelible and
invisible message is embedded into both the image
and the audio track of the motion picture as it
passes through the server. According to R. Agarwal,
piracy of digital assets such as image, video, audio
and text can be protected by inserting a digital
watermark into the data thus providing a promising
way to protect digital data from illicit copying and
manipulation [5]. After embedding the watermark,
the data and watermark are in-separable [7].
The security of relational databases has been a
great concern since the expanded use of these data
over the Internet. Because data allow unlimited
number of copies of an “original” without any
quality loss and can also be easily distributed and
forged [7]. Hence, Digital watermarking for
relational databases emerged as a candidate solution
to provide copyright protection, tamper detection,
traitor tracing and maintaining integrity of
relational data [8].
III. PROPOSED WORK
The proposed work is divided into preprocessing,
Watermark Bits Generation, Data Partitioning,
Selection of dataset for watermarking, Watermark
Embedding, Watermark decoding and finally
Majority Voting.
In Preprocessing the non-numerical dataset which
contain numerical and image data is given for
preprocess. Preprocessing is vital step which
removes unwanted datas from the dataset and
noises present in the image.
Create DB
Decoding
Using MD5
Generate
Watermarking Bits
(DW)
Decode DB
Decoding
Accuracy
Calculation
Calculate
Adaptive
Threshold
MD5
Embedded
Bits
DW1
Attacker
Channel
Figure 1: Stages of Watermark Encoding and Decoding
A. Data Partitioning
Bucketization can be viewed as a special case of
slicing used here for data partitioning, where there
are exactly two columns: one column contains only
the SA, and the other contains all the QIs. The
advantages of slicing over bucketization can be
understood as follows: First, by partitioning
attributes into more than two columns, slicing can
be used to prevent membership leak. Our empirical
evaluation on a real data set shows that
bucketization does not prevent membership
disclosure. Second, unlike bucketization, which
requires a clear separation of QI attributes and the
sensitive attribute, slicing can be used without such
a separation. For data set such as the census data,
one often cannot clearly separate QIs from SAs
because there is no single external public database
that one can use to determine which attributes the
adversary already knows. Slicing can be useful for
such data. Finally, by allowing a column to contain
both some QI attributes and the sensitive attribute,
attribute correlations between the sensitive attribute
and the QI attributes are preserved.
In this step, the data set is partitioned according
to three factors: (1) the number of buckets M
desired by the user, (2) for M buckets, the
proportion p of buckets allocated to the head bucket
and the proportion allocated to the tail bucket1-p,
and (3) the proportion of query mass c represented
by the head bucket. Initially, the distribution is split
into a head and tail, where the head contains data
values representing a proportion of the total query
mass – normally between 70 and 90 percent – as
defined by c. The tail contains mass of the
proportion 1-c. The precise value of c is specified
by the user prior to running the algorithm. For the
remaining M-1 buckets, QSB will partition the head
bucket into at most ⌈M*p⌉ buckets, then the tail into
at least ⌊M*(1-p)⌋ buckets.
B. Selection of dataset for watermarking
The following two steps are applied on the
dataset to select tuples for watermarking.
 Threshold Computation and
 Hash Value Computation.
1)
Threshold Computation:
Adaptive threshold technique is used in this step.
Compute Adaptive threshold values for find a
tuples in a database by using centroid.
Adaptive Threshold= C* Mean +
Standard Deviation
Where, C is the constant tuple, Mean is the mean of tuples
and standard deviation is the standard deviation of the tuples.
2)
Hash Value Computation:
In this step, a cryptographic hash function MD5
is applied on the selected dataset to select only
those tuples which have an even hash value. This
step achieves two objectives: (i) it further enhances
the watermark security by hiding the identity of the
watermarked tuples from an intruder; and (2) it
further reduces the number of to-be-watermarked
tuples to limit distortions in the dataset.
The dataset
is used to select tuples with even
hash values and put them in the dataset . The
dataset
, consisting of ζ tuples, is the subpart of
the dataset D and is not physically separated from
the rest of the parts of D.
As selection of ζ tuples is also based on the value
of data selection threshold, Mallory may try to
corrupt the embedded watermark by changing the
data values such that the data selection threshold
value is disturbed and hence Alice is unable to
detect the watermarked tuples in the watermark
detection phase. But, since Mallory has no
knowledge of the confidence factor c; therefore, he
may be able to only arbitrarily attack some selected
part of the watermarked data to corrupt the
watermark with some probability P. This
probability is made smaller by using data selection
threshold T and even hashes values.
C. Generation of Watermark bits
Reed–Solomon (RS) codes are non-binary cyclic
error-correcting codes which are used here for
generation watermark bits. In Reed–Solomon
coding, source symbols are viewed as coefficients
of a polynomial p(x) over a finite field. The original
idea was to create n code symbols from k source
symbols by oversampling p(x) at n > k distinct
points, transmit the sampled points, and use
interpolation techniques at the receiver to recover
the original message.
Reed-Solomon (R-S) codes in terms of the
parameters n, k, t, and any positive integer m > 2
can be expressed as
Where n - k = 2t is the number of parity symbols,
and t is the symbol-error correcting capability of the
code. The generating polynomial for an R-S code
takes the following form:
The degree of the generator polynomial is equal
to the number of parity symbols. R-S codes are a
subset of the Bose, Chaudhuri, and Hocquenghem
(BCH) codes; hence, it should be no surprise that
this relationship between the degree of the
generator polynomial and the number of parity
symbols holds, just as for BCH codes. Since the
generator polynomial is of degree , there must be
precisely
successive powers of that are roots
of the polynomial. We designate the roots of
as ,
, …,
. It is not necessary to start with
the root starting with any power of is possible
[13].
D. Encoding
Encoding tuples and append watermarking bits
using MD5 and Adaptive threshold technique.
MD5
The MD5 message-digest algorithm is a widely
used cryptographic hash function producing a 128bit (16-byte) hash value, typically expressed in text
format as a 32 digit hexadecimal number.
 Append padding bits
The input message is "padded" (extended) so that
its length (in bits) equals to 448 mod 512. Padding
is always performed, even if the length of the
message is already 448 mod 512.
Padding is performed as follows: a single "1" bit
is appended to the message, and then "0" bits are
appended so that the length in bits of the padded
message becomes congruent to 448 mod 512. At
least one bit and at most 512 bits are appended.
 Append length
A 64-bit representation of the length of the
message is appended to the result of step1. If the
length of the message is greater than 2^64, only the
low-order 64 bits will be used.
The resulting message (after padding with bits
and with b) has a length that is an exact multiple of
512 bits. The input message will have a length that
is an exact multiple of 16 (32-bit) words.
 Initialize MD buffer
A four-word buffer (A, B, C, D) is used to
compute the message digest. Each of A, B, C, D is
a 32-bit register. These registers are initialized to
the following values in hexadecimal, low-order
bytes first):
word A: 01 23 45 67
word B: 89 ab cd ef
word C: fe dc ba 98
word D: 76 54 32 10
Process message in 16-word blocks
Four functions will be defined such that each
function takes an input of three 32-bit words and
produces a 32-bit word output.
F (X, Y, Z) = XY or not (X) Z

G (X, Y, Z) = XZ or Y not (Z)
H (X, Y, Z) = X xor Y xor Z
I (X, Y, Z) = Y xor (X or not (Z))
Figure 2: The structure of MD5 algorithm (Rivest, 1992)
A. Watermark Decoding:
The watermarked bit is not given for decoding
algorithm. The remaining bit is subjected to MD5
decoding.
Decoding an MD5 hash without knowing the
original value of the encoded string is not totally
accurate, there are some values that may have a
higher degree of certainty because they may form
recognizable elements, such as existing words,
using a dictionary approach to look-up the hash and
compare with known-existing values. Some MD5
reverse look-up databases contain millions of
hashes and their corresponding decoded values.
This is generally considered the easiest method as it
can be executed in mere fractions of a second. A
second method uses a more brute-force approach by
using tables – commonly known as “rainbow
tables” – to analyze the encoded MD5 elements.
Neither approach has 100% certainty of successful
decoding, however, the possibility that it might has
caused MD5 to be identified as technically insecure
by National Security Agency (NSA) standards.
IV. EXPERIMENTAL RESULT
In this section, the experimental results are
calculated for the proposed work. For this subset of
50000 tuples are selected from a real-life dataset.
This work has been performed on a server that has
Pentium(R) Dual-Core CPU 2.10GHz with 4GB of
RAM in JAVA platform.
The decoding accuracy is calculated for the task
given is shown in below figure 3. The decoding
results of the watermarked dataset uses the majority
voting scheme to eliminate decoding errors (if any)
as a result of malicious attack (or attacks).
Figure 3: Decoding Result.
Watermarks. In DRM ’04: Proceedings of the 4th
ACM Workshop on Digital Rights Management,
pages 73–82. ACM Press, 2004.
[9] M. Shehab, E. Bertino, and A. Ghafoor,
“Watermarking
relational
databases
using
V. CONCLUSION:
optimization-based techniques,” IEEE Transactions
on Knowledge and Data Engineering, vol. 20, no. 1,
The watermarking data suffers from serious of
pp. 116–129, 2008.
attacks in many methods. In this work security
[10] B. Schneier, Applied Cryptography. John Wiley,
mechanism helps to resolve ownership conflicts
1996.
over watermarked dataset in case of additive attacks. [11] Raju Halder, Shantanu Pal and Agostino Cortesi,”
And this method provides less distortion and
Watermarking Techniques for Relational Databases:
maximum accuracy for decoding. It proves the
Survey, Classification and Comparison”, Journal of
robustness of watermarking scheme by analyzing
Universal Computer Science, vol. 16, no. 21 (2010),
its decoding accuracy under different types of
3164-3190 .
malicious attacks using a real world dataset. It also [12] M. Kamran, Sabah Suhail, and Muddassar Farooq,
“A Robust, Distortion Minimizing Technique for
provide solutions to resolve conflicting ownership
Watermarking Relational Databases Using Onceissues in case of the additive attack in which
for-All Usability Constraints” IEEE Transactions
Mallory inserts his own watermark in Alice’s
on Knowledge and Data Engineering, Dec. 2013
watermarked database.
(vol. 25 no. 12).
[13] Bernard Sklar, “Reed-Solomon Codes”.
REFERENCES
[1] R. Agrawal, P. Haas, and J. Kiernan,
“Watermarking relational data: framework,
algorithms and analysis,” The VLDB Journal, vol.
12, no. 2, pp. 157–169, 2003.
[2] M. Shehab, E. Bertino, and A. Ghafoor,
“Watermarking
relational
databases
using
optimization-based techniques,” IEEE Transactions
on Knowledge and Data Engineering, vol. 20, no. 1,
pp. 116–129, 2008.
[3] R. Sion, M. Atallah, and S. Prabhakar, “Rights
protection for relational data,” IEEE Transactions
on Knowledge and Data Engineering, vol. 16, no. 6,
pp. 1509–1525, 2004.
[4] Kaipa, B. and Robila, S.A.. (2010). Statistical
Steganalyis of Images Using Open Source Software,
Applications and Technology Conference (LISAT),
2010 Long Island Systems, pp.1-5, ISBN: 978-14244-5550-8, May 2010, Farmingdale, NY.
[5] Cortes, C and Vapnik, V. (1995) Support-vector
networks, in Machine Learning, Vol.20, No.3, pp.
273-297, ISSN: 0885-6125.
[6] Devijver, P.A. and J. Kittler (1982). Pattern
Recognition: A Statistical Approach. Prentice-Hall
Internat., London.
[7] R. Agarwal and J Kieman, Watermarking relational
databases In Proceedings of 28th International In
Proceedings of 28th International Conference on
very large databases, Hong Kong, China, 2002
[8] Y. Li, H. Guo, and S. Jajodia. Tamper Detection
and Localization for Categorical Data Using Fragile
From figure 3, the results of the decoding show
maximum accuracy. Thus it eliminates maximum
attacks and errors and decodes maximum given
input dataset.
Download