Interword Distance Changes Represented by Sine Waves for Watermarking Text Images Ding Huang and Hong Yan School of Electrical and Information Engineering University of Sydney, NSW 2006 Email: hding@ee.usyd.edu.au, yan@ee.usyd.edu.au Abstract Digital watermarking is widely believed to be a valid means to discourage illicit distribution of information content. Digital watermarking methods for text documents are limited because of the binary nature of text documents. A distinct feature of a text document is its space patterning. In this paper we propose a new approach in text watermarking where interword spaces of different text lines are slightly modified. After the modification, the average spaces of various lines have the characteristics of a sine wave and the wave constitutes a mark. Both private and public watermarking algorithms are discussed in this paper. Preliminary experiments have shown promising results. Our experiments suggest space patterning of text documents can be a useful tool in digital watermarking. Index Terms Copyright protection, correlation detection, image processing, space patterning, text watermarking, wave coding. I. Introduction With the wide spread use of the Internet in our society, the distribution and access of information is greatly facilitated. However, without methods which prevent or discourage illicit redistribution and reproduction of information content, copyright can be easily infringed. Digital watermarking is widely believed to be a valid solution to the problem and currently there is intensive research in this area, in both academic and industrial communities. Compared to the plurality of previously proposed methods in digital watermarking for picture and video images, digital watermarking methods for text documents are very limited. One reason for this difference is that text is a binary image and lacks rich grayscale information. Digital watermarking for text documents is discussed in [3], [4], and [6]-[9]. There are primarily three types of text watermarking methods which have been developed previously. They are (a) Line-Shift Coding – vertically shifts the locations of text lines to encode the document. (b) Word-Shift Coding – horizontally shifts the locations of words within text lines to encode the document. (c) Feature Coding – chooses certain text features and alters those features. These three methods require the original unmarked text for decoding. Furthermore, since there are already variations in the original unmarked text, these methods need to establish some measurement basis to differentiate watermarks from the original variation. In this paper, we propose a new text watermarking method. It makes use of the distinct character of a text document – space, more specifically interword spaces of text lines, to watermark a text document. Our encoding technique adjusts interword spaces in a text document so that mean spaces across different lines show characteristics of a sine wave, and information can be encoded in the sine wave(s). Since watermarking is embedded in both horizontal and vertical directions, the method can be inherently robust against interference. Furthermore, information can be recovered either with or without the original image, and control lines or control blocks are not needed for decoding. We propose to call our new watermarking method Space Coding. In Section 2, we analyze the characteristics of space in a text document and show that average spaces of text lines are random. We define a random variable for average space in Section 3 and propose to use a sine wave to modify average spaces for encoding information. In Section 4, we describe watermarking detection methods and present some test results. Finally we present concluding remarks in Section 5 and suggest future work. II. Space Profile and Statistics A text page in digital form can be represented by the following function f ( x, y ) [0,1], x 0,1,...,W , y 0,1,..., L that presents black and white pixels. Here, W and L are the width and length of the page in pixels respectively. In digital image processing, interword space is detected with the vertical projection profile b v( x) f ( x, y) y t which is a summation of the “on” pixels along a vertical column from t (top) to b (bottom) of a text line. If there is no “on” pixel for a consecutive run of x v( x) 0, x k , k 1,..., k c then interword space is detected. Figure 1 shows a typical vertical profile of five words. Intuitively, average space of a text line can be used as a parameter to study the characteristics of space patterning in a text document. For a line with d words, we have an average space S a S t (d 1) , where d 1 (1) S t is the total interword space in a text line calculated in pixels. There are generally two types of texts. One is aligned only at the left, another is justified at both the left and the right. Our experiments show that space marking on text which is justified at both sides is more noticeable than the marking on text aligned only at the left. For this reason, the experiments in this paper focus on text which is justified at both sides. Figure 2 shows a typical profile of S a with respect to text lines in a paragraph. It can be seen that S a varies randomly across different lines. III. Space Marking Due to the random nature of average spaces of text lines across a text document, we can define a discrete random variable (rv) X (n) as follows: X (n) S an , n 0,1,..., N 1 (2) n represents the index of a text line in a text document with N lines. S an represents S a of line n . Another view of our space where marking method can be considered as marking on the rv X (n) . A space varying sine wave, or, more specifically a sine wave which varies over a number of text lines, has some attractive characteristics for watermarking interword space as follows: 1) A sine wave varies gradually so local variation may be unnoticed. 2) A sine wave’s amplitude, frequency and phase can be used to carry coding information. 3) A sine wave’s periodic symmetry may make decoding easy and reliable. Thus, we can use different text lines across a text document to encode a sine wave. More specifically, the values of Sa for different text lines, or their variations, can act as sampling values of a sine wave. For space watermarking to be unnoticeable, changes in interword spaces have to be kept to a minimum. On the other hand, there should be enough space modification so that marking can be correctly detected. These contradicting requirements suggest a narrow range of marking amplitude. The Sampling Theorem suggests that for a sine wave to be reconstructed correctly, the sampling frequency has to be at least twice that of the sine wave. From the viewpoint of human visual perception [10], there exist certain frequencies at which variations are most noticeable. It is reasonable to avoid marking in the neighborhood of these frequencies. So the capacity of a sine wave’s frequency to carry watermarking information is limited. d Thus, the phase of a sine wave has been chosen primarily to carry information in space watermarking. In comparison to merely shifting words horizontally in word-shift coding, the modification of the interword space of a text line involves the words in this line being horizontally expanded or shrunk so that the required total interword space, and thus S a of the line, is obtained. Suppose a new average space S a ' is to be used after the modification of the interword space of a text line. Then, the change of the total interword space of this text line in pixels is Stc ( Sa 'Sa )( d 1) where (3) S d S tc ES i (5) i 1 In our implementation, the difference added to the largest S d is ES i . To expand or shrink a word, vertical lines of equal intervals in the word are duplicated or removed respectively. The interval is calculated as Pxli Iv i ES i (6) Again, the interval Ivi is rounded to an integer. After the expansion or shrinkage of this word, it will have a new width in pixels, which is d is the number of words, and S a is the original average space of the text line as in (1). If S tc 0 , then the total interword space of this text line will expand and the words in this text line will shrink. If S tc 0 , the total interword space of this text line will shrink and the words in this text line will expand. For any word in this text line with an index i , suppose its width before the modification is Pxli in pixels, then the expansion or shrinkage of width distributed to this word is S ES i d tc Pxli , Pxl i i 1 if S tc 0 (4.a) or S ES i d tc Pxli , Pxl i i 1 Pxli ' Pxli ES i (7) The two sides of a text line are not changed while the line is being expanded or shrunk. To shrink or expand words, at the left half of this text line, the left side of each word is kept fixed and the word is shrunk or expanded, i.e. vertical lines of equal intervals are removed or duplicated; at the right half of this text line, the right side of each word is kept fixed and the word is shrunk or expanded. A single text page or several pages of a text document are considered to be a workplace. Relevant text lines in this workplace are considered as sampling points of the sine wave for watermarking. Phase information can be either, the absolute phase in this workplace or, the relative phase if several waves are involved. For the proposed space marking method, we have implemented both private and public watermarking algorithms. A. Private Watermarking if S tc 0 (4.b) 1. First, the mean of S a is calculated q a1 S n p an q p 1 ,0 p q N (8) ES i is rounded to an integer, since it presents a number of pixels. Therefore, a difference may exist between S tc and the sum of ES i , which is where p and q are the indices of the first and last text lines in the workplace between which a watermarking sine wave will reside. 2. Then for each line, a watermark component is determined by the sine wave Wn C1a1 sin( 1 (n p) 1 ) (9) W n represents the desired watermark component of a text line with an index of n ; where 1 and 1 are the radian frequency and initial phase angle of the sine wave respectively. C1 is a constant determining the amplitude of the sine wave. 3. Next, W n is added to S a for line n , thus generating a new average space which is S an ' S an Wn 1. Based on the above observation, first a key is chosen in a text so that all text lines whose number of words are larger than, or equal to, the key are watermarked. 2. Then a set S w of text lines is selected from the text so that the number of words from each line in this set is not less than the selected key. 3. Then the mean of S a of text lines in set S w is calculated v a2 (10) S m u am v u 1 ,0 u v N where u and v are similar to p and q in (8), but u and v are the indices of the text 4. Finally, the words in each text line are modified accordingly by applying formulas (3) ~ (7). The private method can be considered as adding a constant part to the original rv X (n) , S w , rather than the indices in the original text in our implementation; m is the index of a text line within S w , and S am represents S a of line m . lines within thus generating a new rv Y (n) 4. For each text line in Y (n) X (n) Wn Unlike private watermarking, after which neighboring text lines still have random values of S a , the values of S a of the lines used in public watermarking should have a certain relationship so that they can directly act as the sampling values of a sine wave. Our experiments show that it is inappropriate to take all lines in a text document for public watermarking because of the degree of variation in S a of the original text lines. Observation of the profiles of S a suggests text lines with a larger number of words have closer values of S a . This can be contributed to two reasons. First, in a text line with a larger number of words, an average word and its associated space is allocated a smaller number of pixels. Thus, the difference between S a of similar lines is smaller. Second, a text line with a larger number of words is less likely to be justified, or is justified to a smaller degree. S w , a watermark component is determined by a sine wave. (11) B. Public Watermarking (12) Wm C2 a2 sin( 2 (m u ) 2 ) (13) where Wm represents the desired watermark component of text line m ; 2 and 2 are the radian frequency and initial phase angle of the sine wave respectively. 5. Next, for each line in S w , S a is replaced with the sum of a 2 and Wm , thus generating a new average space which is S am ' a2 Wm , if line m S w , otherwise unchanged 6. (14) Finally, the words in each text line are modified accordingly by applying formulas (3) ~ (7). Thus, for text lines belonging to S w , we have a new rv Y (n) Y (m) a2 Wm , if line m S w , otherwise unchanged (15) IV. Detection and Performance of sampling points N in the encoding sine wave of (9) or (13) such that When a text has been watermarked using the private method, we can obtain rv Y (n) by a reconstruction of 0 N , where is 1 or 2 S a as in formula (1). With the Our test results are shown in Table 1~3. original unmarked text, we have the watermark component W n from (11) Wn Y (n) X (n) Correct Rate Figure 3 shows a reconstructed profile of Sa and detected watermarking information of the text of Figure 2 after it has been subjected to private watermarking. Interword spaces were only expanded for this example. When a text has been watermarked using the public method, and supposing the watermarking key is known, thus the set S w of the text lines is reconstructed and the mean of the average spaces a 2 is recalculated as in (12), we also have the watermark component Half Wave Sampling Points Table 1. 10 7 5 3 20/20 20/20 20/20 20/20 Detection results of private watermarking. Half Wave Sampling Points Correct Rate Table 2. 7 6 5 3 14/15 15/15 14/15 21/23 Detection results of public watermarking. Wm from (15) Half Wave Sampling Points Wm Y (m) a 2 , for text lines in S w Down Sampling Next, the originally marked phase information can be detected by calculating the crosscorrelation of a detecting sine wave with W n or Skewing 6 3 15/15 15/15 15/15 15/15 Table 3. Detection results after watermarked texts have been edited. Wn r ( j) where T 1 1 W (n) Ad sin( d n j ) (16) T n 0 From our experiments, it can be concluded that space in text documents can be watermarked unnoticeably and the watermarks can be correctly detected, even after the watermarked texts have been subjected to editing operations. W represents W n or Wm ; d is the radian frequency of the detecting sine wave; and j represents a lag in the number of text lines and varies so as to detect the marked phase information. Through the j that produces an extreme value of r ( j ) , the original marked Ad is the amplitude of the detecting sine wave. T is the phase information can be recovered. summation number, which depends on the number of items in W n or Wm as well as d [11]. One parameter that has been used in our tests is ‘half wave sampling points’, being a number V. Conclusion A distinct and unique characteristic of a text document is its space pattern. We have developed new algorithms to digitally watermark text documents by utilizing their space. Our method slightly modifies interword spaces so that different lines across a text act as sampling points of a sine wave. Preliminary experiments have shown promising results. Compared to previously proposed text watermarking methods, our method can be implemented for both private and public watermarking. Furthermore, by embedding information on both horizontal and [4] J. Brassil, L. O’Gorman, “Watermarking Document Images with Bounding Box Expansion”, in Anderson [1], pp. 227-235. [5] Special Issue on Copyright and Privacy Protection, IEEE Journal on Selected Areas in Communications, vol.16, no. 4, May 1998. [6] S. Katzenbeisser, F. A.P. Petitcolas, Eds. “Information Hiding Techniques for Steganography and Digital Watermarking”, Boston, Artech House, 2000. [7] S. H. Low, N.F. Maxemchuk, “Performance Comparison of Two Text Marking Methods”, in special issue [5], pp. 561-572. [8] S. H. Low, N.F. Maxemchuk, J.T. Brassil, and L.O’Gorman, “Document Marking and Identification Using Both Line and Word Shifting”, Proc. Infoncom’95, Boston, MA, April 1995, pp. 853-860. [9] S. H. Low, N.F. Maxemchuk, A.M. Lapone, “Document Identification for Copyright Protection Using Centroid Detection”, IEEE Transactions on Communications, vol. 46, no. 3, pp. 372-383, March 1998. [10] T.N. Cornsweet, “Visual Perception”, New York, Academic Press, 1970, pp. 330-342. [11] E.C. Ifeachor, B.W. Jervis, “Digital Signal Processing: A Practical Approach”, Reading, Mass., Addison-Wesley, 1993. [12] S. Craver, N. Memon, B. Yeo, M. Yeung, “Resolving Rightful Ownerships with Invisible Watermarking Techniques: Limitations, Attacks, and Implications”, in special issue [5], pp. 573586. vertical directions, combined with utilizing averaging operations, it can be inherently robust against interference. Overall, our method is better than previously proposed methods in digital watermarking for text documents. Currently, we are exploring if space patterning of a text document can be used in other ways for digital watermarking. The principle of sine wave coding, which is utilized in the experiments for this paper, may also be useful for digital watermarking in the frequency domain for general grayscale picture and video images. References [1] R. Anderson, Ed. “Proceedings of the First International Information Hiding Workshop, Cambridge, U.K., May/June, 1996”, vol. 1174 of Lecture Notes in Computer Science, Berlin, Springer-Verlag, 1996. [2] D. Aucsmith, Ed. “Proceedings of the Second International Information Hiding Workshop, Portland, Oregon, USA, April 1998”, vol. 1525 of Lecture Notes in Computer Science, Berlin, Springer-Verlag, 1998. [3] J. Brassil, S. Low, N. Maxemchuk, and L. O’Gorman, “Electrical Marking and Identification Techniques to Discourage Document Copying”, IEEE Journal on Selected Areas in Communications, vol.13, no. 8, pp. 1495-1504, October 1995. 30 25 pixel 20 15 10 5 271 261 251 241 231 221 211 201 191 181 171 161 151 141 131 121 111 101 91 81 71 61 51 41 31 21 1 11 0 Figure 1. Vertical profile of 5 words (original resolution is 300 pixel/inch, re-sampled at interval of 3 pixels horizontally). 50 45 Top: average space (pixel) Bottom: word number 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 text line Figure 2. Profile of average space for text lines in a paragraph (resolution: 300 pixels/inch). 55 Top: reconstructed average space Bottom: detected watermark information (pixel) 50 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 text line Figure 3. Reconstructed profile of average space and detected watermark information after the text of figure2 has been subjected to private watermarking (resolution: 300 pixels/inch).