Savitribai Phule Pune University PROJECT TITLE A Seminar By Studet name(BSCDS) Submitted in partial fulfillment of the requirements for the degree of (BSC Data Science) November 27,2023 Accepted by the University Date HEAD OF THE DEPARTMENT Acknowledgements It gives me immense pleasure in presenting the preliminary seminar report on “Hid- ing the message behind the Words:Advance in Natural Language Watermarkig”. I feel great pleasure to express my deep sense of gratitude towards the administrative and technical staff of SPPU for helping and supporting me. I owe thanks to my beloved family and friends for their kind co-operation and valuable help. Last but not the least I express my deep sense of gratitude to all my wellwishers. Student Name(Roll No) ii Abstract Digital Watermarking is a new technique of Watermarking, which is used to hide message and encrypt a digital signal. Digital Watermarking is used to protect content. Basically Watermarking is used on images, audio and video. Digital Watermarking has many different techniques, but this paper focus on text. Digital Watermarking has been implemented for Chines, English, Arabic and Turkish languages text by different methods. This paper developed a new Digital Watermarking algorithm, a new technique implemented on English language. This paper proposes an algorithm for English grammatical word and encryption process. This Digital Watermarking procedure mainly marked grammatical word such as verb, conjugation, preposition and articles. This process produces an encrypt message which is used by the watermark. This technique is applied on different websites which is verify it , for example https://louisem.com/1912/free-watermark-softwarewatermark-online and etc.... Keywords:Text watermarking; Natural language processing; RSA; Encryption; Author’s authenticity iii Contents 1 Introduction 1 1.1 Project Overvie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Literature Review 2.1 3 Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Digital Watermarking Technique 3.1 3.2 5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.1 Space Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.2 Feature Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 6 3.2.1 Synonym substitution . . . . . . . . . . . . . . . . . . . . . . 6 3.2.2 Syntactic transformation . . . . . . . . . . . . . . . . . . . . 6 3.2.3 Semantic transformation . . . . . . . . . . . . . . . . . . . . 6 4 Methodology 4.1 3 Mathematical Formula 7 . . . . . . . . . . . . . . . . . . . . . . . . . 7 5 Result 9 5.1 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 iv List of Tables v List of Figures 4.1 Create a Security key using algorithm . . . . . . . . . . . . . . . . . . vi 8 Chapter 1 Introduction 1.1 Project Overvie This is the age of technology and in this age, we use the internet abundantly. When using the internet, we access loads of digital data. But in past, large amount of data was not available digitally, instead it was available in the form of a hard copy. Now a days digital data is more convenient than hard copy data as sharing, gathering , storing it is easy as compared to hard copies of the same data. Digital text data is more secure as compared to hard copies. In case of any changes, it is easier in digital data. Digital data not only provides us with great amenities but also is a haven for malicious illegal attacks, piracy attack and create copyright palavers problems[1]. Besides this digital text information can be illegally copied or susceptible to threat like important data theft, authentication problems, and forgery. There are various solutions or many effective ways to overcome these threats. Confidentiality, authenticity, integrity can be used to overcome these threats [2]. It is an impossible thing to protect data against copyright totally, but it is possible to protect data from copyrights by using different methods. Well known watermarking techniques protect the digital data from copyrights. Watermarking hides the content and embeds some secret information or data into original digital text, then compares it and finds out owner attribution of data and tracking infraction [3]. However different kind of digital watermarking method have been developed recently. Some methods focus on or finds out attribution of the 1 CHAPTER 1. INTRODUCTION 2 owner using semantic or syntactic analysis, text format, line space, characters or words font [2]. Mostly digital watermarking is applied on images, audio, video, text. Watermarking is very useful to protects against illegal threats, copyrights and piracy data. In this paper, the algorithm tells us about speech tags including modal verbs, prepositions, conjunctions and articles. This paper used RSA algorithm for encryption messages, which is used for hiding owner name and etc. This algorithm is also able to show the real author name from which the text document is copied with the highest accuracy. And it also provides necessary protection against various threats based on the word document, by modifying the font of the letters in the word document. Finally, through a dilution of the test, it explores and shows the accuracy and efficiency of the algorithm. 1.2 Problem Statement Copyrights issues are create, Specially when the data is in digital form.If this data is published then copyrights problem create in front of data owners. 1.3 Research Objective This paper main objective is to find more effective Digital Watermark technique.To protect owner digital data. Chapter 2 Literature Review 2.1 Related Documents Makarand L. Mali, Nitin N. Patil and J. B. Patil [2] proposed a watermarking algorithm based on English grammatical words and used encryption technique. Author also focused on grammatical rules like conjunctions, pronouns and modal verbs to generate encrypted watermark message. Chen Li and You Fucheng [3] implements a text digital watermarking algorithm based on the word document through study several technologies of text digital watermarking. It realizes the watermark embedding by modifying the font of the letters in word document. Yingli Zhang et al. [6] proposed this method for watermarking based on the word document for controlling the dissemination and conserving copyright, for both Chinese and English language. main thing of this paper is hiding information. In this approach, each object of word document contains information of author and legal user after performing encryption technique. C. Culnaneet. al. [7] has suggested a watermarking method for formatted text documents. In this method, word spaces are used and take the documents as one long line for watermarking. Nevertheless, the author proposed a unique method of threshold and thresholding buffering. Xianghe Jing, HuapingFei, YuHao and Zhijun Li [8] proposed a novel text encryption method based on natural languageprocessing (NLP). Three linguistic transformation Synonym substitution, Syntactic transformations, Semantic transformations 3 CHAPTER 2. LITERATURE REVIEW 4 are introduced and new encryption technique is provided. Daojing Li and Bo Zhang [9] have implemented a Dual Watermarking method founded on ambit Cryptography (DWTC) for Web information to solve the cruxes of robustness and invisibleness. This job founded on ambit cryptography, watermarking method can improve the toughness. MercanTopkara et al. [10] has suggested a natural language watermarking by applying sentence composition to apply a watermark The text phrase compositions such as characters, words and lines were modified to apply the necessary information. The authors provide an audit of governing status of the efficiency in natural language watermarking, tools and techniques for text processing.. Chapter 3 Digital Watermarking Technique A digital watermarking is a signal or string embedded in a noise-tolerant . this signals is usually identified the proprietary right of the copyright of these signal. It is developed different embedded algorithm. Various categories of algorithm or technique are derived or generated by the researchers. 3.1 3.1.1 Document Structure Space Coding Two types of space coding are mainly used.1) Line space 2) Word space. 1) Line space is worked with space between two adjacent rows of paragraph.Watermark is embedded by tactfully changing the space of the adjacent line. Though having strong robustness and difficult to track watermark, the capacity of the watermark is very small and difficult for visualization. 2)The next one more technique is Word Space Coding.Word coding working with Horizontally movement of the word.Word Coding Shifting same row left or right.Invisible coding is a part of space coding.Watermark data is attached at the line break. But it is difficult for visualization whether it is tab or space at the end of the line. 5 CHAPTER 3. DIGITAL WATERMARKING TECHNIQUE 3.1.2 6 Feature Coding In Feature coding change features font-family, indent, color, text-style and font-style. the effect of this different types of information and information capacity of the watermark are widely popular than another space coding. 3.2 3.2.1 Natural Language Processing Synonym substitution Synonym substitution most popularly and simple watermark embedded technique. This techniques special thing is it can not change the meaning of sentence. and use synonyms word with instead of word . 3.2.2 Syntactic transformation Syntactic transformation work with syntactic transformation of sentence.quietly we make some small changes in meaning of sentence.To make a sentence Active ,passive , slicing a sentence, placing topic at the beginning of the sentence is some approach of syntactic transformation. 3.2.3 Semantic transformation Data representation is change from one model to another using Semantic information for watermarking different types of semantics technique are used.To make same meaningful sentence replace the word instead of same word or phrase are few approaches of semantic transformations. Chapter 4 Methodology To share the information on the internet because of that illegal data,copy data that types issues are generated for owner of data and writers also.To protect the original authorship and copyright, the necessity of digital text watermarking is raising upward. It is protect authorship and copyright along with the original form of data, a robust strong watermarking algorithm is needed. In this paper, a strong and more robust algorithm is proposed for digital watermarking based on Natural Language Processing technique. The proposed algorithm work with Parts of Speech (POS) tags.This method scarp a web page and sum of number of total occurrences of modal verbs, prepositions, conjunctions and articles of the text document. Then convert the number of occurrences into binary and concatenate this binary number with author’s ID. this is is approved id ,which is done bye owner himself 4.1 Mathematical Formula Let n(p) = Number of total occurrences of preposition in the text document. n(c) = Number of total occurrences of conjunction in the text. n(mv) = Number of total occurrences of modal verb in the text. n(a) = Number of total occurrences of article in the text. AuthID = Author’s ID 7 CHAPTER 4. METHODOLOGY 8 Step-1 Key = n n X n(p) + n(c) + n(mv) + n(a) + AuthID o (4.1) length=1 Step-2 Key = (Key)binary RSA algorithm is applied to this combined key for final encryption to generate the watermark. This algorithm is implemented in python language. Figure 4.1: Create a Security key using algorithm Chapter 5 Result This paper proposed algorithm on three web pages content and generate watermarking. Then is shown below. 5.1 Case study 1 Find the web page on the internet and scrap the web page and get only content.On that contents algorithm apply and this algorithm create a unique key, which is show the watermark because of that key we find copyrights contents.that is the result of case study 1. A unique key is generated by applying our method. Figure 1, Figure 2 shows the process 5.2 Case study 2 We use another web page for our experiment. Second web page: https://nytcrossword.com/tag/it-may-allow-a-textdocument-to-be-displayed-on-aweb-page-crosswordclue Create a Security key using algorithm Figure 5, Figure 6 show the process of the algorithm. Figure 6 shows the encrypted key generated by applying our algorithm. 9 CHAPTER 5. RESULT 10 Chapter 6 Conclusion and Future Scope In this paper author find new watermarking techniques.which is apply on English language watermark and some grammatical rules.many web pages applied this technique for secure watermarking. This encrypted watermark of this algorithm is more secure and help to more protect authorship and copyrights.in this paper elaborated the algorithm, in that algorithm no of watermark converted to binary,after that apply on RSA algorithm and make robust key.In This paper watermark technique apply on English language.English language is more compatible to other language .Our proposed algorithm is implemented in python because python is more compatible to natural language. In the future, author will try to make consistent with other languages (especially Latin) with this algorithm. 11 Bibliography [1] Wang Zhigang, Rearch of Watermarking algorithm for WORD Document,China Science and Technology Information, Mar.2010, pp.114. [2] Makarand L. Mali, Nitin N. Patil, J. B. Patil, Implementation of Text Watermarking Technique Using Natural Language Watermarks, 2013 International Conference on Communication Systems and Network Technologies. [3] hen Li, You Fucheng, The Study on Digital Watermarking Based on Word document, 2013 International Conference on Mechatronic Sciences [4] hen Qing, Zhou Limin, The Research of Digital Watermarking Algorithm Based on WORD Document Image Processing, 2010, pp. 271–350. Yingli Zhang, Huaiqing Qin [6] ingli Zhang, Huaiqing Qin, A Novel Robust Text Watermarking For Word Document 3rd International Congress on Image and Signal Processing,Vol. 1, pp. 38-42, October 2010. [7] . Culnane, H. Treharne, and A.T.S. Ho, Improving Multi-Set Formatted Binary Text Watermarking Using Continuous Line Embedding, in Proceedings of IEEE International Conference on Innovative Computing, Information and Control (ICICIC-07), Kumamoto, Japan, pp. 287-29, 2007. on Multimedia Information Networking and Security [8] ianghe Jing, Yu Hao, HuapingFei, Zhijun Li,Text Encryption Algorithm Based on Natural Language Processing, 2012 Fourth InternationalConference 12 BIBLIOGRAPHY 13 [9] aojing Li and Bo Zhang, DWTC: A Dual Watermarking Scheme Based on Threshold Cryptography for Web Document, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010) [10] . Topkara, New Designs for Improving the Efficiency and Resilience of Natural Language Watermarking, PhD Thesis, Purdue University, WestLafayette, Indiana, 2007t.