A New Text Watermarking Method

advertisement
Interword Distance Changes Represented by Sine Waves for
Watermarking Text Images
Ding Huang and Hong Yan
School of Electrical and Information Engineering
University of Sydney, NSW 2006
Email: hding@ee.usyd.edu.au, yan@ee.usyd.edu.au
Abstract
Digital watermarking is widely believed to be a
valid means to discourage illicit distribution of
information content. Digital watermarking
methods for text documents are limited because
of the binary nature of text documents. A distinct
feature of a text document is its space patterning.
In this paper we propose a new approach in text
watermarking where interword spaces of
different text lines are slightly modified. After the
modification, the average spaces of various lines
have the characteristics of a sine wave and the
wave constitutes a mark. Both private and public
watermarking algorithms are discussed in this
paper. Preliminary experiments have shown
promising results. Our experiments suggest
space patterning of text documents can be a
useful tool in digital watermarking.
Index Terms  Copyright protection,
correlation detection, image processing, space
patterning, text watermarking, wave coding.
I.
Introduction
With the wide spread use of the Internet in our
society, the distribution and access of
information is greatly facilitated. However,
without methods which prevent or discourage
illicit redistribution and reproduction of
information content, copyright can be easily
infringed. Digital watermarking is widely
believed to be a valid solution to the problem
and currently there is intensive research in this
area, in both academic and industrial
communities.
Compared to the plurality of previously
proposed methods in digital watermarking for
picture and video images, digital watermarking
methods for text documents are very limited.
One reason for this difference is that text is a
binary image and lacks rich grayscale
information.
Digital watermarking for text documents is
discussed in [3], [4], and [6]-[9]. There are
primarily three types of text watermarking
methods which have been developed previously.
They are
(a) Line-Shift Coding – vertically shifts the
locations of text lines to encode the
document.
(b) Word-Shift Coding – horizontally shifts the
locations of words within text lines to
encode the document.
(c) Feature Coding – chooses certain text
features and alters those features.
These three methods require the original
unmarked text for decoding. Furthermore, since
there are already variations in the original
unmarked text, these methods need to establish
some measurement basis to differentiate
watermarks from the original variation.
In this paper, we propose a new text
watermarking method. It makes use of the
distinct character of a text document – space,
more specifically interword spaces of text lines,
to watermark a text document. Our encoding
technique adjusts interword spaces in a text
document so that mean spaces across different
lines show characteristics of a sine wave, and
information can be encoded in the sine wave(s).
Since watermarking is embedded in both
horizontal and vertical directions, the method can
be inherently robust against interference.
Furthermore, information can be recovered either
with or without the original image, and control
lines or control blocks are not needed for
decoding.
We propose to call our new
watermarking method Space Coding.
In Section 2, we analyze the characteristics of
space in a text document and show that average
spaces of text lines are random. We define a
random variable for average space in Section 3
and propose to use a sine wave to modify
average spaces for encoding information. In
Section 4, we describe watermarking detection
methods and present some test results. Finally
we present concluding remarks in Section 5 and
suggest future work.
II.
Space Profile and Statistics
A text page in digital form can be represented
by the following function
f ( x, y )  [0,1], x  0,1,...,W , y  0,1,..., L
that presents black and white pixels. Here, W
and L are the width and length of the page in
pixels respectively.
In digital image processing, interword space is
detected with the vertical projection profile
b
v( x)   f ( x, y)
y t
which is a summation of the “on” pixels along a
vertical column from t (top) to b (bottom) of a
text line. If there is no “on” pixel for a
consecutive run of x
v( x)  0, x  k , k  1,..., k  c
then interword space is detected. Figure 1 shows
a typical vertical profile of five words.
Intuitively, average space of a text line can be
used as a parameter to study the characteristics
of space patterning in a text document. For a line
with d words, we have an average space
S a  S t (d  1) ,
where
d 1
(1)
S t is the total interword space in a text
line calculated in pixels.
There are generally two types of texts. One is
aligned only at the left, another is justified at
both the left and the right. Our experiments show
that space marking on text which is justified at
both sides is more noticeable than the marking
on text aligned only at the left. For this reason,
the experiments in this paper focus on text which
is justified at both sides.
Figure 2 shows a typical profile of
S a with
respect to text lines in a paragraph. It can be seen
that S a varies randomly across different lines.
III.
Space Marking
Due to the random nature of average spaces of
text lines across a text document, we can define a
discrete random variable (rv) X (n) as follows:
X (n)  S an , n  0,1,..., N  1 (2)
n represents the index of a text line in a
text document with N lines. S an represents
S a of line n . Another view of our space
where
marking method can be considered as marking
on the rv X (n) .
A space varying sine wave, or, more
specifically a sine wave which varies over a
number of text lines, has some attractive
characteristics for watermarking interword space
as follows:
1) A sine wave varies gradually so local
variation may be unnoticed.
2) A sine wave’s amplitude, frequency and
phase can be used to carry coding
information.
3) A sine wave’s periodic symmetry may make
decoding easy and reliable.
Thus, we can use different text lines across a
text document to encode a sine wave. More
specifically, the values of Sa for different text
lines, or their variations, can act as sampling
values of a sine wave.
For space watermarking to be unnoticeable,
changes in interword spaces have to be kept to a
minimum. On the other hand, there should be
enough space modification so that marking can
be correctly detected. These contradicting
requirements suggest a narrow range of marking
amplitude.
The Sampling Theorem suggests that for a
sine wave to be reconstructed correctly, the
sampling frequency has to be at least twice that
of the sine wave. From the viewpoint of human
visual perception [10], there exist certain
frequencies at which variations are most
noticeable. It is reasonable to avoid marking in
the neighborhood of these frequencies. So the
capacity of a sine wave’s frequency to carry
watermarking information is limited.
d
Thus, the phase of a sine wave has been
chosen primarily to carry information in space
watermarking.
In comparison to merely shifting words
horizontally in word-shift coding, the
modification of the interword space of a text line
involves the words in this line being horizontally
expanded or shrunk so that the required total
interword space, and thus S a of the line, is
obtained.
Suppose a new average space
S a ' is to be
used after the modification of the interword
space of a text line. Then, the change of the total
interword space of this text line in pixels is
Stc  ( Sa 'Sa )( d  1)
where
(3)
S d  S tc   ES i
(5)
i 1
In our implementation, the difference
added to the largest
S d is
ES i .
To expand or shrink a word, vertical lines of
equal intervals in the word are duplicated or
removed respectively. The interval is calculated
as
 Pxli
Iv i  
 ES i



(6)
Again, the interval
Ivi is rounded to an
integer. After the expansion or shrinkage of this
word, it will have a new width in pixels, which is
d is the number of words, and S a is the
original average space of the text line as in (1).
If S tc  0 , then the total interword space of
this text line will expand and the words in this
text line will shrink. If S tc  0 , the total
interword space of this text line will shrink and
the words in this text line will expand.
For any word in this text line with an index i ,
suppose its width before the modification is
Pxli in pixels, then the expansion or shrinkage
of width distributed to this word is


 S

ES i   d tc Pxli ,
 Pxl

i
 

i 1
if
S tc  0 (4.a)
or


 S

ES i   d tc Pxli ,
 Pxl

i
 

i 1
Pxli '  Pxli  ES i
(7)
The two sides of a text line are not changed
while the line is being expanded or shrunk. To
shrink or expand words, at the left half of this
text line, the left side of each word is kept fixed
and the word is shrunk or expanded, i.e. vertical
lines of equal intervals are removed or
duplicated; at the right half of this text line, the
right side of each word is kept fixed and the
word is shrunk or expanded.
A single text page or several pages of a text
document are considered to be a workplace.
Relevant text lines in this workplace are
considered as sampling points of the sine wave
for watermarking. Phase information can be
either, the absolute phase in this workplace or,
the relative phase if several waves are involved.
For the proposed space marking method, we
have implemented both private and public
watermarking algorithms.
A. Private Watermarking
if
S tc  0 (4.b)
1. First, the mean of
S a is calculated
q
a1 
S
n p
an
q  p 1
,0  p  q  N
(8)
ES i is rounded to an integer, since it presents
a number of pixels. Therefore, a difference may
exist between S tc and the sum of ES i , which is
where p and q are the indices of the first
and last text lines in the workplace between
which a watermarking sine wave will reside.
2.
Then for each line, a watermark component
is determined by the sine wave
Wn  C1a1 sin( 1 (n  p)  1 ) (9)
W n represents the desired watermark
component of a text line with an index of n ;
where
1
and 1 are the radian frequency and
initial phase angle of the sine wave
respectively. C1 is a constant determining
the amplitude of the sine wave.
3.
Next, W n is added to S a for line n ,
thus generating a new average space which is
S an '  S an  Wn
1.
Based on the above observation, first a key
is chosen in a text so that all text lines whose
number of words are larger than, or equal to,
the key are watermarked.
2. Then a set S w of text lines is selected from
the text so that the number of words from
each line in this set is not less than the
selected key.
3. Then the mean of S a of text lines in set S w
is calculated
v
a2 
(10)
S
m u
am
v  u 1
,0  u  v  N
where u and v are similar to p and q in
(8), but u and v are the indices of the text
4.
Finally, the words in each text line are
modified accordingly by applying formulas
(3) ~ (7).
The private method can be considered as
adding a constant part to the original rv X (n) ,
S w , rather than the indices in the
original text in our implementation; m is the
index of a text line within S w , and S am
represents S a of line m .
lines within
thus generating a new rv Y (n)
4. For each text line in
Y (n)  X (n)  Wn
Unlike private watermarking, after which
neighboring text lines still have random values
of S a , the values of S a of the lines used in
public watermarking should have a certain
relationship so that they can directly act as the
sampling values of a sine wave.
Our experiments show that it is inappropriate
to take all lines in a text document for public
watermarking because of the degree of variation
in S a of the original text lines. Observation of
the profiles of
S a suggests text lines with a
larger number of words have closer values of
S a . This can be contributed to two reasons.
First, in a text line with a larger number of
words, an average word and its associated space
is allocated a smaller number of pixels. Thus, the
difference between S a of similar lines is
smaller. Second, a text line with a larger number
of words is less likely to be justified, or is
justified to a smaller degree.
S w , a watermark
component is determined by a sine wave.
(11)
B. Public Watermarking
(12)
Wm  C2 a2 sin(  2 (m  u )   2 ) (13)
where
Wm represents the desired watermark
component of text line m ;  2 and  2 are
the radian frequency and initial phase angle
of the sine wave respectively.
5. Next, for each line in S w , S a is replaced
with the sum of
a 2 and Wm , thus generating
a new average space which is
S am '  a2  Wm , if line m  S w ,
otherwise unchanged
6.
(14)
Finally, the words in each text line are
modified accordingly by applying formulas
(3) ~ (7).
Thus, for text lines belonging to S w , we have
a new rv Y (n)
Y (m)  a2  Wm , if line m  S w ,
otherwise unchanged
(15)
IV.
Detection and Performance
of sampling points N in the encoding sine wave
of (9) or (13) such that
When a text has been watermarked using the
private method, we can obtain rv Y (n) by a
reconstruction of
0  N   , where  is 1 or  2
S a as in formula (1). With the
Our test results are shown in Table 1~3.
original unmarked text, we have the watermark
component W n from (11)
Wn  Y (n)  X (n)
Correct Rate
Figure 3 shows a reconstructed profile of
Sa
and detected watermarking information of the
text of Figure 2 after it has been subjected to
private watermarking. Interword spaces were
only expanded for this example.
When a text has been watermarked using the
public method, and supposing the watermarking
key is known, thus the set S w of the text lines is
reconstructed and the mean of the average spaces
a 2 is recalculated as in (12), we also have the
watermark component
Half Wave Sampling Points
Table
1.
10
7
5
3
20/20
20/20
20/20
20/20
Detection
results
of
private
watermarking.
Half Wave Sampling Points
Correct Rate
Table
2.
7
6
5
3
14/15
15/15
14/15
21/23
Detection
results
of
public
watermarking.
Wm from (15)
Half Wave Sampling Points
Wm  Y (m)  a 2 , for text lines in S w
Down Sampling
Next, the originally marked phase information
can be detected by calculating the crosscorrelation of a detecting sine wave with W n or
Skewing
6
3
15/15
15/15
15/15
15/15
Table 3. Detection results after watermarked
texts have been edited.
Wn
r ( j) 
where
T 1
1
W (n) Ad sin(  d n  j ) (16)
T n 0
From our experiments, it can be concluded
that space in text documents can be watermarked
unnoticeably and the watermarks can be
correctly detected, even after the watermarked
texts have been subjected to editing operations.
W represents W n or Wm ;  d is the
radian frequency of the detecting sine wave; and
j represents a lag in the number of text lines
and varies so as to detect the marked phase
information. Through the j that produces an
extreme value of r ( j ) , the original marked
Ad is the
amplitude of the detecting sine wave. T is the
phase information can be recovered.
summation number, which depends on the
number of items in W n or Wm as well as  d
[11].
One parameter that has been used in our tests
is ‘half wave sampling points’, being a number
V.
Conclusion
A distinct and unique characteristic of a text
document is its space pattern. We have
developed new algorithms to digitally watermark
text documents by utilizing their space. Our
method slightly modifies interword spaces so
that different lines across a text act as sampling
points of a sine wave. Preliminary experiments
have shown promising results. Compared to
previously proposed text watermarking methods,
our method can be implemented for both private
and public watermarking.
Furthermore, by
embedding information on both horizontal and
[4] J. Brassil, L. O’Gorman, “Watermarking
Document Images with Bounding Box
Expansion”, in Anderson [1], pp. 227-235.
[5] Special Issue on Copyright and Privacy
Protection, IEEE Journal on Selected Areas in
Communications, vol.16, no. 4, May 1998.
[6] S. Katzenbeisser, F. A.P. Petitcolas, Eds.
“Information
Hiding
Techniques
for
Steganography and Digital Watermarking”,
Boston, Artech House, 2000.
[7] S. H. Low, N.F. Maxemchuk, “Performance
Comparison of Two Text Marking Methods”, in
special issue [5], pp. 561-572.
[8] S. H. Low, N.F. Maxemchuk, J.T. Brassil, and
L.O’Gorman,
“Document
Marking
and
Identification Using Both Line and Word
Shifting”, Proc. Infoncom’95, Boston, MA, April
1995, pp. 853-860.
[9] S. H. Low, N.F. Maxemchuk, A.M. Lapone,
“Document
Identification
for
Copyright
Protection Using Centroid Detection”, IEEE
Transactions on Communications, vol. 46, no. 3,
pp. 372-383, March 1998.
[10] T.N. Cornsweet, “Visual Perception”, New York,
Academic Press, 1970, pp. 330-342.
[11] E.C. Ifeachor, B.W. Jervis, “Digital Signal
Processing: A Practical Approach”, Reading,
Mass., Addison-Wesley, 1993.
[12] S. Craver, N. Memon, B. Yeo, M. Yeung,
“Resolving Rightful Ownerships with Invisible
Watermarking Techniques: Limitations, Attacks,
and Implications”, in special issue [5], pp. 573586.
vertical directions, combined with utilizing
averaging operations, it can be inherently robust
against interference. Overall, our method is
better than previously proposed methods in
digital watermarking for text documents.
Currently, we are exploring if space patterning
of a text document can be used in other ways for
digital watermarking. The principle of sine wave
coding, which is utilized in the experiments for
this paper, may also be useful for digital
watermarking in the frequency domain for
general grayscale picture and video images.
References
[1] R. Anderson, Ed. “Proceedings of the First
International Information Hiding Workshop,
Cambridge, U.K., May/June, 1996”, vol. 1174 of
Lecture Notes in Computer Science, Berlin,
Springer-Verlag, 1996.
[2] D. Aucsmith, Ed. “Proceedings of the Second
International Information Hiding Workshop,
Portland, Oregon, USA, April 1998”, vol. 1525 of
Lecture Notes in Computer Science, Berlin,
Springer-Verlag, 1998.
[3] J. Brassil, S. Low, N. Maxemchuk, and L.
O’Gorman,
“Electrical
Marking
and
Identification
Techniques
to
Discourage
Document Copying”, IEEE Journal on Selected
Areas in Communications, vol.13, no. 8, pp.
1495-1504, October 1995.
30
25
pixel
20
15
10
5
271
261
251
241
231
221
211
201
191
181
171
161
151
141
131
121
111
101
91
81
71
61
51
41
31
21
1
11
0
Figure 1. Vertical profile of 5 words (original resolution is 300 pixel/inch, re-sampled at interval of 3 pixels
horizontally).
50
45
Top: average space (pixel)
Bottom: word number
40
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
text line
Figure 2. Profile of average space for text lines in a paragraph (resolution: 300 pixels/inch).
55
Top: reconstructed average space
Bottom: detected watermark information
(pixel)
50
45
40
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
text line
Figure 3. Reconstructed profile of average space and detected watermark information after the text of
figure2 has been subjected to private watermarking (resolution: 300 pixels/inch).
Download