Masters Computing Minor Thesis Concept based Tree Structure Representation for Paraphrased Plagiarism Detection By Kiet Nim nimhy003@mymail.unisa.edu.au A thesis submitted for the degree of Master of Science (Computer and Information Science) School of Computer and Information Science University of South Australia November 2012 Supervisor Dr. Jixue Liu Associate Supervisor Dr. Jiuyong Li i Declaration I declare that the thesis presents the original works conducted by myself and does not incorporate without reference to any material submitted previously for a degree in any university. To the best of my knowledge, the thesis does not contain any material previously published or written except where due acknowledgement is made in the content. Kiet Nim November 2012 ii Acknowledgements I would like to express my sincere gratitude to my supervisors, Dr. Jixue Liu and Dr. Jiuyong Li – professors and researchers at University of South Australia, for their dedicated support, professional advice, feedback and encouragement throughout the period of conducting the study. In addition, I would like to thank all of my course coordinators for their dedicated and in-depth teaching. Finally, I would like to thank my family for always encouraging and providing me their full support throughout the study in Australia. iii Abstract In the era of World Wide Web, searching for information can be performed easily by the support of several search engines and online databases. However, this also makes the task of protecting intellectual property from information abuses become more difficult. Plagiarism is one of those dishonest behaviors. Most existing systems can efficiently detect literal plagiarism where exact copy or only minor changes are made. In cases where plagiarists use intelligent methods to hide their intentions, these PD systems usually fail to detect plagiarized documents. The concept based tree structure representation can be the potential solution for paraphrased plagiarism detection – one of the intelligent plagiarism tactics. By exploiting WordNet as the background knowledge, concept-based feature can be generated. The additional feature in combining with the traditional term-based feature and the termbased tree structure can enhance document representation. In particular, this modified model not only can capture syntactic information like the term-based model does but also can discover hidden semantic information of a document. Consequently, semantic-similar documents can be detected and retrieved. The contributions of the modified structure can be expressed in the following two. Firstly, a real-time prototype for high level plagiarism detection is proposed in this study. Secondly, the additional concept-based feature provides considerable improvements for the task of Document Clustering in the way that more semantic-related documents can be grouped in same clusters even though they are expressed in different ways. Consequently, the task of Document Retrieval can retrieve more relevant documents in same topics. iv Table of Contents Declaration .......................................................................................................................... ii Acknowledgements ............................................................................................................ iii Abstract .............................................................................................................................. iv List of Figures ................................................................................................................... vii List of Tables ................................................................................................................... viii Chapter 1 – Introduction ..................................................................................................... 1 1.1. Background .......................................................................................................... 1 1.2. Motivations........................................................................................................... 1 1.3. Fields of Thesis .................................................................................................... 2 1.4. Research Question ................................................................................................ 2 1.5. Contributions ........................................................................................................ 3 Chapter 2 – Literature Review ............................................................................................ 4 2.1. Plagiarism Taxonomy .......................................................................................... 4 2.2. Document Representation .................................................................................... 5 2.2.1. Flat Feature Representation .......................................................................... 5 2.2.2. Structural Representation .............................................................................. 8 2.3. Plagiarism Detection Techniques ....................................................................... 10 2.4. Limitations ......................................................................................................... 11 Chapter 3 – Methodology ................................................................................................. 13 3.1. Document Representation and Indexing ............................................................ 13 3.1.1. Term based Vocabulary Construction ........................................................ 13 3.1.2. Concept based Vocabulary Construction .................................................... 14 3.1.3. Document Representation ........................................................................... 15 3.1.4. Document Indexing ..................................................................................... 16 3.2. Source Detection and Retrieval .......................................................................... 17 3.3. Detail Plagiarism Analysis ................................................................................. 18 3.3.1. Paragraph Level Plagiarism Analysis ......................................................... 18 3.3.2. Sentence Level Plagiarism Analysis ........................................................... 19 Chapter 4 – Experiments ................................................................................................... 20 4.1. Experiment Initialization .................................................................................... 20 v 4.1.1. The Dataset and Workstation Configuration .............................................. 20 4.1.2. Performance Measures and Parameter Configuration ................................ 21 4.2. Source Detection and Retrieval for Literal Plagiarism ...................................... 21 4.3. Source Detection and Retrieval for Paraphrased Plagiarism ............................. 23 4.4. Study of Parameters ........................................................................................... 25 4.4.1. Size of Term based Vocabulary π1π ............................................................ 25 4.4.2. Size of Concept based Vocabulary π1πΆ ....................................................... 26 4.4.3. Dimensions of Term based PCA feature .................................................... 27 4.4.4. Dimensions of Concept based PCA feature ................................................ 28 4.4.5. Contribution of the Weights π€1 and π€2 ...................................................... 29 Chapter 5 – Conclusion and Future Works ....................................................................... 31 References ......................................................................................................................... 32 vi List of Figures Figure 1 - Taxonomy of Plagiarism .................................................................................... 6 Figure 2 - Term-Document Matrix ..................................................................................... 7 Figure 3 - Singular Value Decomposition of term-document matrix A ............................. 8 Figure 4 - 3 layer document-paragraphs-sentences tree representation ............................ 10 Figure 5 - Comparision of the original & modified Porter Stemmers .............................. 13 Figure 6 - Data structure of Term-based Vocabulary ....................................................... 14 Figure 7 - Example of looking for synonyms, hypernyms and hyponyms ....................... 14 Figure 8 - Data structure for the concept-based Vocabulary ............................................ 15 Figure 9 - Concept based Document Tree Representation ............................................... 16 Figure 10 - 2 level SOMs for document-paragraph-sentence document tree ................... 17 Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism ........... 23 Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism .. 24 Figure 13 - Performance based on different sizes of Term based Vocabulary ................. 26 Figure 14 - Performance based on different sizes of Concept based Vocabulary ............ 27 Figure 15 - Performance based on different dimensions of Term based PCA feature ..... 28 Figure 16 - Performance based on different dimensions of Concept based PCA feature . 29 vii List of Tables Table 1 - Configuration of Parameters for Literal Plagiarism .......................................... 21 Table 2 - Source Detection & Retrieval for Literal Plagiarism ........................................ 22 Table 3 - Configuration of Parameters for Paraphrased Plagiarism ................................. 23 Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism ............................... 24 Table 5 - Performance based on different sizes of Term-based Vocabulary .................... 25 Table 6 - Performance based on different sizes of Concept based Vocabulary................ 26 Table 7 - Performance based on different dimensions of Term based PCA feature......... 28 Table 8 - Performance based on different dimensions of Concept based PCA feature .... 29 Table 9 - Performance based on different values of π€1 and π€2 ....................................... 30 viii Chapter 1 – Introduction 1.1. Background In the era of World Wide Web, more and more documents are being digitalized and made available for accessing remotely. Searching for information has become even easier with the support of variety of search engines and online databases. However, this also makes the task of protecting intellectual property from information abuses become more difficult. One of those dishonest behaviors is plagiarism. It is clear that plagiarism has caused significant damages to intellectual property. Most cases are detected in academic such as student assignments and research works. Lukashenko et al. [1] define plagiarism as activities of “turning of someone else’s work as your own without reference to original source”. Several systems and algorithms have been developed to tackle this problem. However, most of them can only detect word-by-word plagiarism or can be referred as literal plagiarism. These are cases where plagiarists do the exact copy or only make minor changes to original sources. But in cases where significant changes are made, most of these “flat feature” based methods fail to detect plagiarized documents [2]. This type is referred as intellectual or intelligent plagiarism including text manipulation, translation and idea adoption. In our research, we focus on improving an existing structural model and conducting several experiments to test the detection of one tactic of text manipulation - Paraphrasing. 1.2. Motivations Paraphrasing is a strategy of intellectual plagiarism to bypass the systems only detecting exact copy or plagiarized documents with minor modifications. For instance, one popular and widely used system in academic for plagiarism detection is Turnitin. It can be seen that Turnitin can detect word-by-word copy efficiently to sentence level. However, by simply paraphrasing detected terms using their synonyms/hyponyms/hypernyms or similar phrases, we can bypass it easily. Apparently, paraphrasing is just one of many existing intelligent tactics. It is clear that plagiarism has become more and more sophisticated [3]. Therefore, it is also urgent to have more powerful mechanisms to protect intellectual property from high level plagiarism. In different plagiarism detection (PD) systems, documents are represented by different non-structural or structural schemes. Non-structural or flat feature based representations are the earliest mechanisms for document representation. Systems such as COPS [4] and SCAM [5] are typical applications of these schemes where documents are broken into small chunks of words or sentences. These chunks are then hashed and registered against a hash table to perform document retrieval (DR) and PD. Systems in [6-8] use character 1 or word n-grams as the units for similarity detection. In common, all these flat feature based systems ignore the contextual information of words/terms in a document. Therefore, structural representations are recently developed to overcome this limitation. In these schemes, 2 promising candidates which can capture the rich textual information are graphs [9, 10] and tree structure representations [2, 11-13]. The applications of structural representation have shown significant improvements in the tasks of DR, document clustering (DC) and PD. However, majority of both non-structural and structural representation schemes are mostly based on word histograms or derived features from word histograms. Obviously, they can be used to effectively detect literal plagiarism but are not strong enough to perform such tasks of intelligent plagiarism detection. In this research, we focus on analyzing the tree structure representation and the studies of Chow et al. [2, 11-13]. In their works, a document is hierarchically organized into layers. In this way, the tree can capture not only syntactic but also semantic information of a document. While the root represents global information or main topics of the document, other layers capture local information or sub-topics of the main topics and the leaf level can be used to perform detailed comparison. Their proposed models have improved significantly the accuracy of DC, DR and PD. However, the features used to represent each layer are still derived from the term-based Vocabulary and, hence, the systems show some limitations when performing the task of intelligent plagiarism detection. Therefore, in our study, we modify the term-based tree structure representation, particularly, the features used to represent each layer in order to perform the detection of a specific type of high level plagiarism – Paraphrasing. The modified structure representation is referred as the Concept based Tree Structure Representation. 1.3. Fields of Thesis Document Representation; Information Retrieval; Plagiarism Detection; Text Mining. 1.4. Research Question This thesis presents the study and development of a new mechanism to detect one particular tactic of sophisticated plagiarism – Paraphrasing. We enhance the original tree structure representation based solely on word histograms with an additional feature to capture multi-dimensional semantic information. We call the additional feature as the Concept-based feature. The ultimate aim of our research is to successfully develop a strong structural representation to discover plagiarism by paraphrasing and, potentially, higher level plagiarism. In our experiments, we focus on examining how concept-based feature in combination with the term-based tree structure representation contributes to the tasks of document organization, document retrieval and paraphrased plagiarism detection. In the modified structural representation, each node of the tree is represented by 2 derived vectors of 2 terms and concepts. To overcome the “Curse of Dimensionality” due to the lengths of these vectors, we further apply Principal Component Analysis (PCA) [14] – a wellknown technique for dimensionality reduction. For the number of tree layers, we choose the 3-layer document-paragraph-sentence model to represent a document. We also consider the task of document organization by applying the Self-Organizing Map (SOM) clustering technique [15]. In document retrieval, we only consider comparing documents in the same areas since documents mention different topics are regarded as serving no purpose [16], e.g. comparing a CIS paper against a collection of CIS papers rather than a CIS paper against biology papers. To generate the concept-based feature, we consider using the external background knowledge – WordNet – to firstly generate the concept-based Vocabulary. After that, the concept-based feature is derived from this Vocabulary and together with the term-based feature used to represent a document. 1.5. Contributions In this thesis, we introduce the C-PCA-SOM 3-stage prototype for high level Plagiarism Detection including: Stage 1 – Document Representation & Indexing; Stage 2 – Source Detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. In addition, due to the achievement in constant processing time when conducting experiments, the prototype can provide real-time applications for Document Representation, Document Clustering, Document Retrieval and, potentially, Paraphrased Plagiarism Detection. Through experiments, it is verified that the introduction of the additional Concept-based feature can improve the performance of Source Detection and Retrieval comparing with models solely based on Term-based feature. Furthermore, it is proved that the enhanced tree structure representation not only can capture syntactic information like the original scheme does but also can discover hidden semantic information of a document. By capturing multi-dimensional information of a document, the task of Document Clustering can be improved in the way that more semantic-related documents can be grouped into meaningful clusters even though they are expressed differently. As a result, Document Retrieval can also be benefited since more documents mentioning the same topics can be detected and retrieved. 3 Chapter 2 – Literature Review This section provides a comprehensive overview of literature on different types of plagiarism in section 2.1. In section 2.2, variety of document representation schemes are discussed including non-structural or flat feature based representation and structural representation. Existing plagiarism detection techniques are outlined in section 2.3. Finally, the limitations of these PD techniques and representation schemes are discussed for potential improvements and further study in section 2.4. 2.1. Plagiarism Taxonomy When human began the artifact of producing papers as a part of intellectual documentation, plagiarism also came into existence. Documentation and plagiarism exist in parallel but they are two different sides of a coin totally. While one contributes to the knowledge body of human society, the other causes serious damages to intellectual property. Realizing this matter of fact, ethical community has developed many techniques to fight against plagiarism. However, the battle against this phenomenon is a lifelong battle since plagiarism has also evolved and become more sophisticated. Therefore, to efficiently engage such devious enemy, it is necessary to have a mapping scheme to identify and classify different types of plagiarism into meaningful categories. Many studies have been conducted to perform this task [1, 3, 17]. Lukashenko et al. [1] point out different types of plagiarism activities including: ο· ο· ο· ο· ο· ο· ο· ο· Copy-paste plagiarism (word to word copying). Paraphrasing (using synonyms/phrases to express same content). Translated plagiarism (copying content expressed in another languages). Artistic plagiarism (expressing plagiarized works in different formats such as images or text). Idea plagiarism (extracting and using others’ ideas). Code plagiarism (copying others’ programming codes). No proper use of quotation marks. Misinformation of references. More precisely, Alzahrani et al. [3] use a taxonomy and classify different types of plagiarism into 2 main categories: literal and intelligent plagiarism Fig. 1. In the former, plagiarists make exact copy or only few changes to original sources. Thus, this type of plagiarism can be detected easily. However, the latter case is much more difficult to detect where plagiarists try to hide their intentions by using many intelligent ways to change original sources. These tactics include: text manipulation, translation and idea adoption. In text manipulating, plagiarists try to change the appearance of the text while keeping the same semantic meaning or idea of the text. Paraphrasing is one tactic of text manipulation that is usually performed. It transforms text appearance by using synonyms, hyponyms, hypernyms or equivalent phrases. In the research, our main focus is to detect 4 this type of intelligent plagiarism. Plagiarism based on Translation is also known as cross-lingual plagiarism. Offenders can use some translation softwares to copy text written in other languages to bypass monolingual systems. Finally, Alzahrani et al. consider idea adoption is the most serious and dangerous type of plagiarism since stealing ideas from others’ works without properly referencing is the most disrespectful action toward their authors and intellectual property. Apparently, this type of plagiarism is also the hardest to detect because plagiarized text might not carry any similar syntactic information to original sources. Another reason is that the ideas being plagiarized can be extracted from multiple parts of original documents. 2.2. Document Representation Since there are a vast number of documents available online and many more are uploaded every day, the demand of efficiently organizing and indexing them for fast retrieval always poses challenges for research community. Many schemes have been developed and improved to represent a document effectively. Instead of using a whole document as a query, these representations can be applied to perform many text processing related tasks such as classification, clustering, document retrieval and plagiarism detection. This section discusses 2 main strategies of document representation as well as available techniques to detect plagiarism. 2.2.1. Flat Feature Representation One of the most popular and widely used model is the Vector Space Model (VSM) [18]. In this model, a weighted vector of term frequency and document frequency is calculated based on a pre-constructed Vocabulary. This Vocabulary is the list of most frequent words/terms and is previously derived from a given training corpus. The scheme used to perform term weighting is TF-IDF. Term frequency (TF) counts the number of occurrences of each term in one specific document while inverse document frequency (IDF) counts the number of documents consist one specific term. In the VSM model, a vector of word histograms is constructed for each document. All vectors together form the term-document matrix Fig. 2. The similarity between two documents is calculated by performing the Cosine distance function on their vectors [19]. One drawback of the VSM model is that vectors used to represent documents are usually lengthy due to the size of the Vocabulary and hence not scalable for large datasets. 5 Figure 1 - Taxonomy of Plagiarism 6 Figure 2 - Term-Document Matrix To overcome the Curse of Dimensionality in VSM, Latent Semantic Indexing (LSI) [20] is proposed to project lengthy vectors to lower number of dimensions with semantic information preserved by mapping the space spanned by these vectors to a lower dimensional subspace. The mapping is performed based on the Singular Value Decomposition (SVD) of the VSM-based term-document matrix Fig. 3. Similarly, another approach for high dimension reduction and feature compression is to apply SelfOrganizing Map (SOM) [15]. In SOM, similar documents are organized closely to each other. Instead of being represented by word histogram vectors, each document is represented by its winner neuron or Best Matching Unit on the map. The applications of SOM such as VSM-SOM [21], WEBSOM [22], LSISOM [23] and in [21, 24] has shown considerable speed up in document clustering and retrieval. SOM can be utilized in combining with not only flat feature representation but also structural representation discussed later in the next section. 7 Figure 3 - Singular Value Decomposition of term-document matrix A Considering only relying on “bag of words” might not be enough, there are many further studies that propose adding additional features and together with term-based flat features to enhance document representation. In [25], Xue et al. propose using distributional features in combination with the traditional term frequency to improve the task of text categorization. The proposed features include: the compactness of the appearances of a word and the position of the first appearance of a word. Basing on these features, they assign specific weight to a word based on its compactness and position, i.e. authors likely to mention main contents in the earlier parts of a document. Hence, words appear in these parts are considered more important and assigned higher weights. Similarly, another approach to “enrich” document representation is to utilize external background knowledge such as WordNet, Wikipedia or thesaurus dictionaries. In [26], Hu et al. use Wikipedia to derive 2 more additional features which are concept-based and categorybased features based on the conventional term-based feature. Their experiments have shown significant improvements in the task of document clustering. Similar applications of external background knowledge can be found in [26-31]. In our study, we take into account the application of WordNet instead of Wikipedia to generate the concept-based feature and use it to enhance document representation. 2.2.2. Structural Representation By only using word histogram vectors to represent documents, it can be seen that flat feature representation ignores the contextual usage and relationship of terms throughout a document [2] and hence leads to the loss of semantic information. In addition, two documents might be contextually different even though they contain the same term distribution. Recognizing this serious limitation, many studies are further carried out trying to develop new ways to represent a document which can capture not only syntactic but also semantic information of a document. These new schemes are referred as structural representation. 8 To capture semantic information, Schenker et al. [9] propose using a directed graph model to represent documents. The graph structure consists of two components: Nodes and Edges. Nodes (Vertices) are terms appear in a document weighted by the number of appearances. Edges that link nodes together indicate the relationship between terms. An edge is only formed between two terms that appear immediately next to each other in a sentence. Chow et al. [10] also study the directed graph and further develop another type of graph – the undirected graph. Similarly, their directed model considers the order of term occurrence in a sentence. In the additional model, the connection of terms in undirected graph is considered without taking the usage order of terms into consideration. They further perform Principal Component Analysis (PCA) for dimensionality reduction and SOM for document organization. Their experiments show significant improvements comparing with other single feature based approaches. Another group of models which can capture both syntactic and semantic information of a document is the group of tree-based representation models. The earliest study of the tree structure representation is conducted by Si et al. [16]. They significantly realize that it is unnecessary to compare documents addressing different subjects. Their model organizes a document according to its structure and hence form the tree, i.e. a document may contain many sections, a section may contain many subsections, a subsection against might also have many sub-subsections, etc. This mechanism significantly improves the effectiveness of document comparison since lower level comparisons can be terminated if higher levels are different. However, the lengthy vectors at each level and potential high number of layers make their model not scalable for big corpuses. Most recent works of Chow et al. [2, 11-13] use fixed number of layers (2 or 3 layers), reduced size of termbased Vocabularies and perform PCA compression to make the tree structure representation applicable for large datasets. To minimize the time complexity of document retrieval, they further apply SOM to organize documents according to their similarities [2, 11]. Fig. 4 denotes the 3-layer document-paragraph-sentence tree representation in [12] and also the model which we focus on improving in the study. Other alternatives of layers and representation units for layers can be found in [2, 11, 13]. 9 Figure 4 - 3 layer document-paragraphs-sentences tree representation 2.3. Plagiarism Detection Techniques According to Lukashenko et al. [1], the task of fighting plagiarism is classified into plagiarism prevention and plagiarism detection. The main difference between two classes is that detection requires less time to implement but can only achieve short-term positive effect. On the other hand, although prevention methods are time consuming to develop and deploy, they have long-term effect and hence are considered to be a more significant approach to effectively fight plagiarism. Prevention, unfortunately, is a global issue and cannot be solved by just one institution. Therefore, most existing techniques fall in the detection category and a lot of researches have been conducted to develop more powerful plagiarism detection techniques. Alzahrani et al. [3] categorize plagiarism detection techniques into 2 broad trends: intrinsic and extrinsic plagiarism detection. Intrinsic PD techniques analyze a suspicious document locally or without collecting a set of candidate documents for comparison [32, 33]. These approaches employ some novel analyses based on authors’ writing styles. Since they are based on the hypothesis that each writer has a unique writing style, thus changing in writing style signals there might be potential plagiarism cases. Features used for this type of PD are Stylometric Features based on text statistics, syntactic features, POS features, closed-class word sets and structural features. On the other hand, extrinsic PD techniques do the comparison between query documents against a set of source documents. Most of the existing PD systems are deploying these extrinsic techniques. There are several common steps to perform extrinsic PD. Firstly, Document Indexing is applied to store all registered documents into databases for later retrieval. Secondly, Document Retrieval is performed to retrieve most relevant candidates that might be 10 plagiarized by given query documents. Eventually, Exhaustive Analysis is carried out between candidates and query documents to locate plagiarized parts. For extrinsic PD techniques, majority of exhaustive analysis methods partition all documents into blocks (n-gram or chunks) [4, 34-36]. Units in each block can be characters, words, sentences, paragraphs, etc. These blocks are then hashed and registered against a hash table. To perform PD, suspicious documents are also divided into small blocks and looked up in the hash table. Eventually, similar blocks are retrieved for detailed comparison. COPS [4] and SCAM [5] are two typical implementations of these approaches. According to [2, 16], these methods are inapplicable for large corpus due to the increasing number of documents from time to time. Furthermore, they can be bypassed easily by making some changes at sentence level. It is noticed that the methods mentioned above apply flat features only and ignore the contextual information of how words/terms are used throughout a document. Two documents have the same term distribution might be different contextually. To tackle this problem, PD systems which utilized structural representation are then proposed [2, 12, 16]. These approaches significantly improve the performance of extrinsic plagiarism detection. Since documents are hierarchically organized into multi levels, the comparison can be terminated at any level where the amount of dissimilarity is over a user-defined threshold between query and candidate documents. Experiments in these structure-based models have shown better performance comparing with flat feature based systems. 2.4. Limitations Most existing PD systems are implemented basing on flat feature representation. As mentioned in 2.1, they cannot capture contextual usage of words/terms throughout a document and can be bypassed easily with minor modifications performed on original sources. Structural representation based PD systems have made significant improvements in capturing the rich textual information. By organizing documents hierarchically, structural models can capture not only syntactic but also semantic information of a document. Recent studies have shown some important contributions of structural representation in document organization such as classification and clustering [2, 11, 13]. Consequently, the task of plagiarism detection is also improved in term of time complexity reduction and achievement of higher detection accuracy. Most relevant documents are firstly retrieved to narrow the processing scope and further comparisons are terminated at levels where similarities are far. Although it has been proved that structural presentation can be applied to detect literal plagiarism efficiently, structural representation based PD systems still show some limitations in detecting intelligent plagiarism. For example, plagiarists can perform paraphrasing and replace the detected words/terms with their synonyms, hyponyms or hypernyms to bypass the detection of these systems. It is noticed that the problems arise 11 from the term-based Vocabulary. In this type of Vocabulary, terms with similar meanings are treated as unrelated. For instance, it can be seen that large, huge and enormous carry similar meaning and are exchangeable in usage. However, in this type of Vocabulary, they are considered as different terms. Therefore, any feature derived from this Vocabulary is not strong enough to detect sophisticated plagiarism. By changing words/terms of an original sentence with semantic similar words/terms, a plagiarized sentence will be treated as a different sentence. In order to discover similar sentences even though they are expressed in different ways, our study exploits the external background knowledge WordNet to construct the conceptbased Vocabulary. The additional Vocabulary is built by grouping words with similar meaning in the term-based Vocabulary into one concept. After that, we use this Vocabulary to generate one more feature called the Concept-based feature to enrich the representation of a document. 12 Chapter 3 – Methodology This section outlines the main techniques we apply to develop the prototype for paraphrasing detection. We call it the 3-stage prototype including: Stage 1 – Document representation & Indexing, Stage 2 – Source detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. The content of Stage 1 is discussed in section 3.1 consisting of the construction of two types of Vocabulary, the extraction of the corresponding 2 types of feature to represent a document and, finally, the application of SOM to organize documents into meaningful clusters. Section 3.2 gives the details of Stage 2 on how to use the stored data in Stage 1 to perform fast original source identification and retrieval. Finally, Stage 3 of the prototype performs plagiarism detection in details based on retrieved candidate documents from Stage 2. The mechanism of the detail analysis of Stage 3 is outlined in section 3.3. 3.1. Document Representation and Indexing 3.1.1. Term based Vocabulary Construction The construction of the term-based Vocabulary is straight forward. Firstly, we perform term extraction from a training corpus. After that, we further apply Word Stemming to transform terms to their simple forms. For example, words such as “computes”, “computing” and “computed” are all considered as “compute”. Because the original Porter stemming algorithm only creates “stems” instead of words in their simple forms and makes it impossible to look up for them on an English dictionary or thesaurus, we have modified the Porter algorithm. The modified version now tries to stop at the stage where words are in or near their simple forms. As a result, it is possible to search for these words’ synonyms, hypernyms and hyponyms via, for example, a thesaurus dictionary. Fig. 5 depicts the different between the original and modified Porter stemmers. Figure 5 - Comparision of the original & modified Porter Stemmers After applying stemming, we subsequently perform stop word removal in order to remove insignificant words such as “a”, “the”, “are”, etc. Finally, we use the weighting 13 scheme TF-IDF (Term Frequency – Inverse Document Frequency) to weight the significance or importance of each word throughout the corpus. The weights of all terms are then ranked from highest to lowest (from most to less significant). In similar to Chow et al. [2, 12], we select the first π1 terms to form the Vocabulary π1π‘ used for Document and Paragraph levels of the tree structure representation. The first π2 terms are selected to form the Vocabulary π2π‘ which is used for Sentence level. In addition, π2 is much larger than π1 . The data structure of the two term-based Vocabularies is denoted in Fig. 6. Figure 6 - Data structure of Term-based Vocabulary 3.1.2. Concept based Vocabulary Construction In order to construct the additional concept-based Vocabulary, we exploit one of the background knowledge, WordNet – the lexical database for English language [37]. WordNet is developed by Miller began in 1985 and has been applied in many text-related processing tasks such as document clustering [29], document retrieval [28, 30] and also for word-sense disambiguation. In WordNet; nouns, verbs, adjectives and adverbs are distinguished and organized into meaningful sets of synonyms. In this section, we outline the mechanism to utilize WordNet for the construction of the concept-based Vocabulary. Firstly, for each term T in the term-based Vocabulary, we look up for its synonyms, hypernyms and hyponyms on WordNet by using the synonym-hypernym-hyponym relationship of the ontology. The result of this step is a “bag” of terms similar to T Fig. 7. After that, we check for these terms’ appearances in the term-based Vocabulary. Any term, which does not appear in the term-based Vocabulary, is removed and the remaining terms generate one concept. This process is repeatedly performed for the whole termbased Vocabulary to achieve the concept-based Vocabulary. Fig. 8 denotes the data structure for the additional Vocabulary. Figure 7 - Example of looking for synonyms, hypernyms and hyponyms 14 Figure 8 - Data structure for the concept-based Vocabulary Similar to the construction of Vocabularies π1π‘ and π2π‘ , we select the first π1 concepts to form the Vocabulary π1π used for Document and Paragraph levels and the first π2 concepts to form the Vocabulary π2π used for Sentence level (π2 is also much larger than π1 ). 3.1.3. Document Representation After the construction of the two types of Vocabulary, we subsequently store them to hard drive and now we are ready to perform the computation of each document’s representation. In our research, we choose the Document-Paragraph-Sentence 3-layer tree representation in [12]. Following Zhang et al., each document is firstly partitioned into paragraphs and each paragraph is similarly partitioned into sentences. This process builds the 3 layer tree representation for each document. The root node represents the whole document, second layer captures information of paragraphs of the document and each paragraph has its sentences situating at corresponding leaf nodes. The modification of the original tree structure is formally carried out in the feature construction steps for each layer. For all layers, term extraction, stemming and stop word removal are still applied to only extract significant terms. For the top and second layers, term-based vectors are still derived normally by performing the checking and weighting process of terms that appear in the term-based Vocabulary π1π‘ . At the same time, we perform mapping all of these terms to their concepts based on the Vocabulary π1π . The weight of a concept is the sum of all elements’ weights. For the bottom layer, instead of using word histograms, we use “appearance indices of terms” vector to indicate the absence/presence of corresponding terms in a sentence, similar to [12]. In addition, “appearance indices of concepts” vector is utilized to indicate the absence/presence of corresponding concepts in the sentence. Up to this stage, each node of the tree is represented by 2 features: the term-based feature and the additional concept-based feature. To overcome the “Curse of Dimensionality”, we still use the Principal Component Analysis (PCA) to compress the features on Document and Paragraph levels. PCA is one of the well-known tools for feature compression and high dimension reduction. We use the same training corpus as the one used for constructing the Vocabularies to calculate 15 two PCA rotation matrices independently for term and concept features. We also store the matrices to hard disk in order to apply for query documents later in the stage of Source Detection and Retrieval. The PCA-projected features are calculated as below: ππ·πͺπ¨ = ππ ∗ πΉ π × π π (1) Where πΉπ = {π1 , π2 , …, ππ } is the normalized term- or concept-based histograms with k dimensions, π π × π is the k x l PCA rotation matrix and πΉπππΆπ΄ is the resulted PCAcompressed feature with reduced dimensions of l (l is much smaller than k). Finally, the data of all documents’ trees is stored to later perform the similarity calculation between suspicious and original documents, paragraphs or sentences for both Source Detection – Retrieval and Detail Plagiarism Analysis. The concept-based tree structure representation of a document is illustrated in Fig. 9. Figure 9 - Concept based Document Tree Representation 3.1.4. Document Indexing According to Si et al. [16], it is unnecessary to compare documents mentioning different topics. Therefore, document organization is crucial to avoid redundant comparisons and minimize processing time as well as computational complexity. For this reason, we apply one clustering technique to organize similar documents into same clusters. The chosen clustering method is Self Organizing Map (SOM) due to its flexibility and time efficiency. All documents in an experiment dataset have their trees organized on the map. We construct 2 SOMs for the root and paragraph levels in the same manner as [2]. Initially, 16 the SOM of the paragraph level is built by mapping all paragraphs’ PCA-compressed term- and concept-based features of all documents on the map. The results of the 2nd level SOM are then used as parts of the inputs for the root level SOM. For the root level SOM, features of the root of each document’s tree in combining with the resulted winner neurons (also known as Best Matching Units - BMU) of the corresponding child paragraphs form the input for the top map. They are then mapped to their nearest BMUs on this root SOM. The mapping process is repeated a number of times in order for all similar documents to converge on the two maps. Eventually, the data of the SOMs is stored to perform fast source detection and retrieval in Stage 2. Fig. 10 illustrates how a document tree is organized onto the document and paragraph SOMs. Figure 10 - 2 level SOMs for document-paragraph-sentence document tree 3.2. Source Detection and Retrieval Stage 2 of the prototype is for the detection and retrieval of source documents or relevant candidates providing suspicious documents. For each query document, in the similar way of when constructing the tree representation for corpus documents in Stage 1, we firstly use the stored term- and concept-based Vocabularies to build the query’s tree representation. Secondly, we load the stored PCA projection matrices and perform feature compression for the query tree representation. After that, we use the root node and find its Best Matching Unit on the document-level SOM. Subsequently, n candidate documents associated with the BMU are retrieved. If the number of documents related to the BMU is less than n, we will retrieve the remaining documents from the BMU’s nearest neighbors that contain documents more similar to those in the BMU. 17 For the n candidate documents, we compute the summed Cosine distance of term- and concept-based PCA vectors between the query document and these candidates. The formula to calculate the summed Cosine distance or overall similarity is defined as: π«(π, π) = ππ . π (πππ , πππ ) + ππ . π (πππ , πππ ) Where (2) π€1 + π€2 = 1 q, c: query and candidate documents π1π , π1π : Term-based PCA projected features π2π , π2π : Concept-based PCA projected features d: Cosine distance function The overall similarity is the sum of individual similarities of different types of feature. π€1 and π€2 are the weights used to emphasize the importance of different features to the overall similarity. In experiments, we assign different weights to each feature to evaluate the degree of contribution of different features on the performance of source detection and retrieval. After the calculation of the summed Cosine distances, we rank them in ascending order and choose t documents that have distances lower than the user-defined similarity threshold π·π‘πππ for further analysis. The threshold π·π‘πππ is calculated as below: π ππ π ππ π«π ππ = π«π ππ π πππ + πΊ(π«πππ − π«πππ ) (3) Where πΊ Ο΅ [0, 1], it is noted that πΊ = 0 is equivalent to single source retrieval, i.e. only the most similar document will be retrieved. 3.3. Detail Plagiarism Analysis The third stage of the prototype involves the detection of local similarity or identifies candidate paragraphs that are most similar to each suspicious paragraph of the query document. By doing so, we can avoid the exhaustive sentence comparison for unrelated paragraphs and speed up the detection process. On the other hand, sentence comparison is carried out for each sentence of the suspicious paragraph and candidate paragraphs to locate potential plagiarism cases. These cases are then summarized and reported to user for human assessment. 3.3.1. Paragraph Level Plagiarism Analysis For each suspicious paragraph of the query document, we similarly find its BMU on the paragraph level SOM. However, we only retrieve paragraphs that belong to the t detected candidates in Stage 2. We do not retrieve other paragraphs even though they are also associated with the BMU because their parent nodes are different from the suspicious paragraph’s parent node (i.e., documents mention different topics). 18 After completely retrieving all candidate paragraphs, we again calculate the summed Cosine distances between these paragraphs and the corresponding suspicious paragraph. Next, we rank the distances also in ascending order and select t’ paragraphs that have distances lower than the similarity threshold π·π‘ππππ to further perform the exhaustive sentence level plagiarism analysis. The threshold π·π‘ππππ is as the following: ππππ π«π ππππ ππππ ππππ = π«πππ + πΊ′(π«πππ − π«πππ ) (4) Where πΊ′ Ο΅ [0, 1] and πΊ′ = 0 is for single plagiarized paragraph detection, i.e. only the most similar paragraph will be retrieved. 3.3.2. Sentence Level Plagiarism Analysis After the retrieval of the most relevant paragraphs for each suspicious paragraph, for each pair of original and suspicious paragraphs, we perform the exhaustive comparison for all of their sentences using the corresponding leaf nodes of the tree representations. Because we use appearance indices of terms and concepts rather than histograms as features for the bottom layer, the calculation of sentence similarity is slightly different. We define the similarity of 2 sentences as the amount of overlap between their terms and concepts. The sentence similarity is calculated by the following formula: πΆππππππ π°ππ Where = ππ πΊππ πΊπΊππ π,ππ ∩ πΊπ,ππ πΊππ πΊπΊππ π,ππ ∪ πΊπ,ππ + ππ πΊππ πΊπΊππ π,ππ ∩ πΊπ,ππ πΊππ πΊπΊππ π,ππ ∪ πΊπ,ππ (5) π€1 + π€2 = 1 πππ πππ ππ,π1 , ππ,π1 : Appearance indices of Terms πππ πππ ππ,π2 , ππ,π2 : Appearance indices of Concepts The overall overlap between a query sentence and a candidate sentence is the sum of individual overlaps of different types of feature. If the summed overlap is larger than the πΆππππππ overlap threshold π°ππ > α Ο΅ [0.5, 1], then this pair of paragraphs is considered as one plagiarism case. In addition, user can flexibly change the overlap threshold to detect more or less plagiarism cases. For example, if α = 0.8, it means that any pair of paragraphs that has the degree of overlap of more than 80% is then considered as a plagiarism case. The exhaustive process is repeatedly performed for other remaining pairs of paragraphs. Finally, all plagiarism cases are presented to user for human review. 19 Chapter 4 – Experiments This section outlines the experiments we carry out to test the performance of our prototype. Firstly, Section 4.1 introduces the dataset used for training and testing plagiarism detection as well as the configuration of the experiment workstation. Up to the point of writing up this thesis, we have conducted experiments on the performance of Source Detection and Retrieval of our system. Further experiments to test the full functionality of our prototype are addressed in Chapter 5 as future works. For the experiments on Source Detection and Retrieval, we compare the results of our prototype (C-PCA-SOM) against variety systems including the original tree-based retrieval (PCASOM) in [2] and the traditional VSM model. For the candidate of latent semantic compression, many previous studies have chosen the LSI model. Therefore, in our study, we would like to provide the comparison with the PCA model instead. All comparative models are slightly modified to use the same modified Porter stemmer like our model. For the 2 SOM-based models, we only use their top SOM maps for document retrieval and temporarily ignore the contribution of the second layer maps. Section 4.2 provides the results of Source Detection and Retrieval for Literal Plagiarism and Section 4.3 is for Paraphrased Plagiarism. In addition, we also carry out the empirical tests to study the contribution of different parameters to the accuracy and optimization of our system such as the weights (w1 , w2 ) or the dimensions of term- and concept- based PCA features. The details are reported in Section 4.4. 4.1. Experiment Initialization 4.1.1. The Dataset and Workstation Configuration We use the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) dataset to test our system. This dataset formed part of PAN 2010 – the international competition on plagiarism detection and is available to be downloaded at the following address http://www.webis.de/research/corpora/corpus-pan-pc-10. In details, there are 7,859 candidate documents in total to make up the original corpus. The test set also contains 7,859 suspicious documents equivalent to those in the corpus, i.e. each document in the corpus has exactly one plagiarized document in the test dataset. The test set consists of 3,792 documents of literal plagiarism (non-paraphrased) cases and 4,067 documents of paraphrased plagiarism cases. For each type of plagiarism, we construct multiple pairs of sub corpus and sub test set with the sizes of 50, 100, 200 and 400 randomly selected documents to evaluate our system in different data scales. For each sub corpus, we perform the processes described in Stage 1 of the 3-stage prototype to firstly organize its documents and later use corresponding sub test set for evaluation. The experiments, computation of PCA rotation matrix and SOM clustering are conducted on a PC with 2.2 GHz Core 2 Duo CPU and 2GB of RAM. 20 4.1.2. Performance Measures and Parameter Configuration To provide comparable results between different models, we use Precision and Recall to evaluate their performance. The computations of Precision and Recall are listed as follow: π·ππππππππ = πΉπππππ = π΅π. ππ πππππππππ πππππππππ π ππππππππ π΅π. ππ πππππππππ π ππππππππ (6) π΅π. ππ πππππππππ πππππππππ π ππππππππ π΅π. ππ ππππππππ π ππππππππ ππ ππππππ (7) Since each original document has exactly one plagiarized document, thus we only consider if the first retrieved document is the correct candidate or not. Hence, we set the scaling parameter ε = 0 in formula (3). In this stage, we assume that the contributions of term- and concept-based features are equivalent and set π€1 = π€2 = 0.5 in formula (2). The empirical study of these parameters is outlined later in section 4.4. Apparently, it is noticed that Precision and Recall are equal when using the Webis-CPC11 dataset. Therefore, we use PR to indicate both of them and further add in the “No of correct retrieval” measure to indicate the number of correctly detected documents for suspicious documents in the test sets. 4.2. Source Detection and Retrieval for Literal Plagiarism To begin with, we arbitrarily set the parameters for each sub corpus as in Table 1. Our model, the C-PCA-SOM, use all of these parameters while the PCA-SOM ignores the concept-based vocabulary and concept-based PCA feature dimensions. The PCA model only uses the term-based vocabulary and term-based PCA feature dimensions. The VSM model only uses the term-based Vocabulary to construct its term-document matrix. In addition, we set π€1 = π€2 = 0.5 as mentioned above to indicate the same amount of contribution between term- and concept-based features. Corpus size 50 100 200 400 π½π»π size 1500 2500 3500 5000 π½πͺπ size 1000 2000 2500 3500 T/C PCA dimensions 40/40 80/80 130/130 220/220 SOM size 6x8 7x8 8x8 8x9 SOM iterations 100 150 200 200 Table 1 - Configuration of Parameters for Literal Plagiarism The results of different models are reported in Table 2 and Fig. 11. The diagram illustrates the PRs of different systems in detecting source documents for Literal Plagiarism cases. It is noticed that our model produces competitive results with other models for the case of single source detection. For the corpus sizes of 100 and 200, our system is even slightly better than the PCA-SOM without concept-based feature. For the corpus size of 50, all systems generate the same results of 96%. It is observed that PCA 21 and VSM seem to perform better for single source detection cases. Even though it takes more retrieval time for PCA and VSM, these models compare per query document with all documents in the corpuses and the possibility of missing the real candidate is unlikely. For SOM-based models such as ours and the PCA-SOM, fast retrieval depends on the results from the earlier clustering process. The clarification is outlined in Section 4.4 where the change of any parameter can affect the accuracy of document clustering and consequently document retrieval. Corpus size 50 100 200 400 Algorithms C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM No of correct retrieval 48/50 48/50 48/50 48/50 91/100 88/100 91/100 91/100 182/200 181/200 185/200 187/200 355/400 355/400 359/400 361/400 PR 0.96 0.96 0.96 0.96 0.91 0.88 0.91 0.91 0.91 0.905 0.925 0.935 0.8875 0.8875 0.8975 0.9025 Table 2 - Source Detection & Retrieval for Literal Plagiarism 22 100% 95% PR 90% C-PCA-SOM 85% PCA-SOM PCA 80% VSM 75% 70% 50 100 200 400 Corpus Size Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism 4.3. Source Detection and Retrieval for Paraphrased Plagiarism In the same manner as conducting the test for Literal Plagiarism, arbitrarily parameters are set firstly. Table 3 denotes the parameter configuration for specific corpuses. The results of Source Detection and Retrieval for Paraphrased Plagiarism cases are reported in Table 3 and Fig. 12. π€1, π€2 are still kept the same as in 4.2. Surprisingly, in the case of Paraphrased Plagiarism, the PCA-SOM model performs better than the C-PCA-SOM in detecting corresponding candidates. In addition, since only global information is involved in retrieval, the exhaustive VSM and PCA still produce better results in finding single source document for a suspicious document. It can be due to the overall topic of a paraphrased document still remains the same as the original document. For different performance of the two SOM-based models, we further investigate the contribution of concept-based feature to the performance of clustering and later retrieval. At this stage, it can be assumed that concept-based feature might introduce noise to the clustering process. To clarify this assumption, we try different values of the weights π€1 and π€2 . The results are reported in Section 4.4. Corpus size 50 100 200 400 π½π»π size 1700 2700 3800 5500 π½πͺπ size 1100 2300 3000 4000 T/C PCA dimensions 45/45 90/90 140/140 240/240 SOM size 6x8 7x8 8x8 8x9 Table 3 - Configuration of Parameters for Paraphrased Plagiarism 23 SOM iterations 100 150 200 200 Corpus Size 50 100 200 400 Algorithms C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM C-PCA-SOM PCA-SOM PCA VSM No of correct retrieval 44/50 46/50 43/50 46/50 83/100 86/100 90/100 93/100 160/200 167/200 174/200 183/200 274/400 288/400 310/400 333/400 PR 0.88 0.92 0.86 0.92 0.83 0.86 0.9 0.93 0.8 0.835 0.87 0.915 0.685 0.72 0.775 0.8325 Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism 100% 95% 90% PR 85% C-PCA-SOM 80% PCA-SOM 75% PCA 70% VSM 65% 60% 50 100 200 400 Corpus Size Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism 24 4.4. Study of Parameters This section provides a comprehensive and empirical study of different parameters to the performance of the C-PCA-SOM model including: the size of term- and concept-based Vocabularies in Sections 4.4.1 and 4.4.2, the dimension of term- and concept-based PCA features in Sections 4.4.3 and 4.4.4 and, lastly, the contribution of weighting parameters π€1 and π€2 in Section 4.4.5. The experiments are carried out on a compound corpus containing both literal and paraphrased plagiarism cases with the size of 300. We randomly choose 150 documents for each type of plagiarism to achieve more accurate outcomes. SOM-based document clustering and retrieval are the two processes affected by the change of any parameter and, hence, they are relatively performed again at each change. It is also noticed that the performance is also slightly different with the same set of parameters as we can see later in the following sections. 4.4.1. Size of Term based Vocabulary π½π»π After performing term extraction, word stemming, stop word removal and concept construction, we achieve the full term-based Vocabulary of 9868 distinct terms and the full concept-based Vocabulary of 6815 distinct concepts. In experiment, we choose different sizes of the term-based Vocabulary π1π to test its contribution to the performance of our prototype. In addition, we keep other parameters consistent as following: π1π size = 4000, T/C PCA dimensions = 200/200, SOM size = 8 x 9, SOM training iterations = 150, π€1 = π€2 = 0.5. Table 5 and Fig. 13 illustrate the performance of our system with different choices of π1π size. It can be seen that the size of π1π does not affect much the accuracy of Source Detection and Retrieval. Precision/Recall fluctuates from 85.66% to 87.3%. However, optimum parameter configuration can be achieved in this case at the size of around 6000. π½π»π Size 3000 4000 5000 6000 7000 8000 No of correct retrieval 261/300 259/300 257/300 262/300 259/300 261/300 PR 0.87 0.863 0.8566 0.873 0.863 0.87 Table 5 - Performance based on different sizes of Term-based Vocabulary 25 100% 95% PR 90% 85% 80% 75% 70% 3000 4000 5000 6000 7000 8000 Term-based Vocab Size Figure 13 - Performance based on different sizes of Term based Vocabulary 4.4.2. Size of Concept based Vocabulary π½πͺπ In this section, we study the influence of different sizes of the Concept-based Vocabulary π1π on the system performance. For the size of the Term-based Vocabulary π1π , we choose the optimum value from 4.4.1 π1π = 6000. While the size of π1π is vary, other parameters are kept as the same as in 4.4.1 ( π1π size = 6000, T/C PCA dimensions = 200/200, SOM size = 8 x 9, SOM training iterations = 150, π€1 = π€2 = 0.5). The results are documented in the following Table 6 and Fig. 14. Similarly, the change of the π1π size does not significantly change the accuracy of candidate retrieval. The highest value of PR is 88% corresponding to the π1π size of 6000 for the corpus size of 300. π½πͺπ Size 2000 3000 4000 5000 6000 No of correct retrieval 259/300 260/300 254/300 259/300 264/300 PR 0.863 0.866 0.846 0.863 0.88 Table 6 - Performance based on different sizes of Concept based Vocabulary 26 100% 95% PR 90% 85% 80% 75% 70% 2000 3000 4000 5000 6000 Concept-based Vocab Size Figure 14 - Performance based on different sizes of Concept based Vocabulary 4.4.3. Dimensions of Term based PCA feature For the experiments on different dimensions of Term-based PCA feature, we set the parameters as in Section 4.4.2 except for the Concept-based Vocabulary π1π size. We choose the π1π size of 6000 which provides the best PR in earlier section. The summary of involved parameters is as the following ( π1π size = 6000, π1π size = 6000, Concept PCA dimensions = 200, SOM size = 8 x 9, SOM training iterations = 150, π€1 = π€2 = 0.5). In this study, we can see clearer trend comparing to the studies in 4.4.1 and 4.4.2. It is observed from the results (Table 7 and Fig. 15) that the change in the dimensions of Term-based PCA feature can significantly influence the performance of the C-PCA-SOM model. Specifically, PR increases from 83.6% to 88% while the dimensions rise from 50 to 200. However, PR drops sharply from 250 onward (more than 60%). Basing on the study, it is clarified that it is unnecessary to use all terms to build the Term-based Vocabulary because it might introduce “noisy” features to system performance. 27 Term based PCA dimensions 50 100 150 200 250 300 No of correct retrieval 251/300 260/300 264/300 264/300 83/300 87/300 PR 0.836 0.866 0.88 0.88 0.276 0.29 Table 7 - Performance based on different dimensions of Term based PCA feature 100% 90% 80% PR 70% 60% 50% 40% 30% 20% 50 100 150 200 250 300 Dimensions of Term-based PCA feature Figure 15 - Performance based on different dimensions of Term based PCA feature 4.4.4. Dimensions of Concept based PCA feature The parameters for the experiments on different dimensions of Concept based PCA feature are set as the following (π1π size = 6000, π1π size = 6000, Term PCA dimensions = 150, SOM size = 8 x 9, SOM training iterations = 150, π€1 = π€2 = 0.5). The only parameter, which is modified, is the dimensions of the Term-based PCA feature which is set as 150 (providing the best retrieval result in 4.4.3). Table 8 and Fig. 16 summarize the experiment outcomes on variety Concept-based PCA feature dimensions. Different from the Term-based PCA feature, the increase of Conceptbased PCA feature dimensions can sometimes slightly enhance or decrease the performance of the C-PCA-SOM system on Source Detection and Retrieval. It is reported 28 from the diagram that PR fluctuate from over 80% to nearly 90%. The highest PR of 89% can be achieved with the Concept-based PCA dimensions of 250. Concept based PCA dimensions 50 100 150 200 250 300 No of correct retrieval 242/300 260/300 255/300 253/300 267/300 256/300 PR 0.806 0.866 0.85 0.843 0.89 0.853 Table 8 - Performance based on different dimensions of Concept based PCA feature 100% 95% PR 90% 85% 80% 75% 70% 50 100 150 200 250 300 Dimensions of Concept-based PCA feature Figure 16 - Performance based on different dimensions of Concept based PCA feature 4.4.5. Contribution of the Weights ππ and ππ Finally, to investigate the contribution of Term- and Concept-based features, we study the usage of different values of π€1 and π€2 corresponding to the assigned “degree of significance” of Terms and Concepts. For parameter configuration, we only modify the dimensions of Concept-based PCA feature. It is set to 250 which produce the best result in 4.4.4. The summary of parameters is as the following (π1π size = 6000, π1πΆ size = 6000, T/C PCA dimensions = 150/250, SOM size = 8 x 9, SOM iterations = 150). 29 Table 9 provides the summary of Source Detection and Retrieval results. It can be seen that (π€1 = 1.0, π€2 = 0.0) and (π€1 = 0.0, π€2 = 1.0) are 2 special cases. The former is the same as the PCA-SOM model without utilizing Concept-based feature. The later only uses Concept-based feature for similarity calculation, i.e. Term-based feature is ignored. Solely using 2 types of feature independently can also achieve satisfactory performance of 87.6% and 84% of PR. However, it is noticed that the combination of Term- and Concept-based features can produce better results by appropriate configuration of the weights π€1 and π€2 . It is observed from the case of the corpus size of 300 that (π€1 = 0.4, π€2 = 0.6) gives the highest PR of 88.3%. From this study, it is proved that Concept-based feature can be used to improve Document Representation, Document Clustering, Document Retrieval and, potentially, Plagiarism Detection. In conclusion, even though VSM and PCA can provide better results than PCA-SOM and C-PCA-SOM, they are unpractical for large datasets. The 2 SOM-based models can be applied for real-time DR and PR due to constant processing time. In addition, C-PCASOM with additional Concept-based feature can achieve better performance by appropriately balancing the significance between Term- and Concept-based features. ππ (Term) ππ (Concept) 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 No of correct retrieval 263/300 262/300 261/300 265/300 254/300 252/300 PR 0.876 0.873 0.87 0.883 0.846 0.84 Table 9 - Performance based on different values of ππ and ππ 30 Chapter 5 – Conclusion and Future Works The concept-based tree structure representation and the C-PCA-SOM 3-stage prototype for Paraphrased PD are introduced in this study. By exploiting the ontology – WordNet – to construct concept-based feature to enhance document representation, the modified structure can capture not only syntactic but also semantic information of a document. As a result, the task of Document Clustering can group more semantic-related documents and Document Retrieval can retrieve more relevant candidates mentioning similar topics. We conduct empirical experiments to test our system against Source Detection and Retrieval. The results we achieve from experiments prove that our prototype can produce competitive performance comparing with other systems. Even though VSM and PCA are better in case of single source detection and retrieval, they are unpractical to apply for large datasets. On the other hand, our prototype can be applied for real-time applications. Furthermore, by setting up appropriate parameters, the performance of C-CPA-SOM model can be relatively improved. In the works carried out in future, we will focus on performing experiments on Stage 3 of the prototype to verify the contribution of the Concept-based feature for the task of Paraphrased Plagiarism Detection. In addition, we plan to study and apply another feature called the Category-based feature to further enhance the concept-modified tree representation. According to [26], the Category-based feature extracting from another type of background knowledge – Wikipedia – can be used to improve the task of Document Clustering. Therefore, part of our future works is to investigate the Categorybased feature and the application of Wikipedia for speeding up the process of DC and DR. Eventually, the automatic configuration of different parameters such as weights of different types of feature for optimum performance also need to be studied in details. 31 References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] R. Lukashenko, V. Graudina, and J. Grundspenkis, "Computer-based plagiarism detection methods and tools: an overview," in Proceedings of the 2007 international conference on Computer systems and technologies, Bulgaria, 2007, pp. 1-6. T. W. S. Chow and M. K. M. Rahman, "Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection," Neural Networks, IEEE Transactions on, vol. 20, pp. 1385-1402, Sept. 2009. S. M. Alzahrani, N. Salim, and A. Abraham, "Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, pp. 133-149, March 2012. S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital documents," SIGMOD Rec., vol. 24, pp. 398-409, May 1995. N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for Digital Documents," in 2nd International Conference in Theory and Practice of Digital Libraries (DL 1995), Austin, Texas, 1995. C. Grozea, C. Gehl, and M. Popescu, "ENCOPLOT: Pairwise sequence matching in linear time applied to plagiarism detection," in Proc. SEPLN, Donostia, Spain, 2009, pp. 10-18. R. Yerra and Y.-K. Ng, "A Sentence-Based Copy Detection Approach for Web Documents," in Fuzzy Systems and Knowledge Discovery. vol. 3613, L. Wang and Y. Jin, Eds., ed: Springer Berlin / Heidelberg, 2005, pp. 481-482. J. Koberstein and Y.-K. Ng, "Using Word Clusters to Detect Similar Web Documents," in Knowledge Science, Engineering and Management. vol. 4092, J. Lang, F. Lin, and J. Wang, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 215228. A. Schenker, M. Last, H. Bunke, and A. Kandel, "Classification of Web documents using a graph model," in Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, 2003, pp. 240-244. T. W. S. Chow, H. Zhang, and M. K. M. Rahman, "A new document representation using term frequency and vectorized graph connectionists with application to document retrieval," Expert Systems with Applications, vol. 36, pp. 12023-12035, March 2009. M. K. M. Rahman and T. W. S. Chow, "Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features," Expert Systems with Applications, vol. 37, pp. 2874-2881, Sept. 2010. H. Zhang and T. W. S. Chow, "A coarse-to-fine framework to efficiently thwart plagiarism," Pattern Recognition, vol. 44, pp. 471-487, 2011. H. Zhang and T. W. S. Chow, "A multi-level matching method with hybrid similarity for document retrieval," Expert Systems with Applications, vol. 39, pp. 2710-2719, Feb. 2012. S. Wold, K. Esbensen, and P. Geladi, "Principal component analysis," Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 37-52, 1987. T. Kohonen, "The self-organizing map," Proceedings of the IEEE, vol. 78, pp. 1464-1480, 1990. 32 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] A. Si, H. V. Leong, and R. W. H. Lau, "CHECK: a document plagiarism detection system," in Proceedings of the 1997 ACM symposium on Applied computing, San Jose, California, United States, 1997, pp. 70-77. L. Sindhu, B. B. Thomas, and S. M. Idicula, "A Study of Plagiarism Detection Tools and Technologies," International Journal of Advanced Research In Technology, vol. 1, pp. 64-70, 2011. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Commun. ACM, vol. 18, pp. 613-620, 1975. J. Zobel and A. Moffat, "Exploring the similarity space," ACM SIGIR Forum, vol. 32, pp. 18-34, 1998. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, vol. 41, pp. 391-407, 1990. K. Lagus, "Text Retrieval Using Self-Organized Document Maps," Neural Processing Letters, vol. 15, pp. 21-29, 2002. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "WEBSOM – Self-organizing maps of document collections," Neurocomputing, vol. 21, pp. 101-117, 1998. N. Ampazis and S. Perantonis, "LSISOM — A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections," Neural Processing Letters, vol. 19, pp. 157-173, April 2004. A. Georgakis, C. Kotropoulos, A. Xafopoulos, and I. Pitas, "Marginal median SOM for document organization and retrieval," Neural Networks, vol. 17, pp. 365-377, 2004. X. Xue and Z. Zhou, "Distributional Features for Text Categorization," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 428-442, March 2009. X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as external knowledge for document clustering," in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris, France, 2009, pp. 389-396. A. Hotho, S. Staab, and G. Stumme, "Ontologies improve text document clustering," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, 2003, pp. 541-544. S. Liu, F. Liu, C. Yu, and W. Meng, "An effective approach to document retrieval via utilizing WordNet and recognizing phrases," in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, United Kingdom, 2004, pp. 266-272. J. Sedding and D. Kazakov, "WordNet-based text document clustering," in Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva, 2004, pp. 104-113. G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios, "Semantic similarity methods in wordNet and their application to information retrieval on the web," in Proceedings of the 7th annual ACM international workshop on Web information and data management, Bremen, Germany, 2005, pp. 10-16. 33 [31] [32] [33] [34] [35] [36] [37] G. Spanakis, G. Siolas, and A. Stafylopatis, "Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents," The Computer Journal, vol. 55, pp. 299-312, March 2012. S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without reference collections," in Advances in data analysis, R. Decker and H.-J. Lenz, Eds., ed Berlin, Heidelberg: Springer, 2007, pp. 359-366. S. Eissen and B. Stein, "Intrinsic Plagiarism Detection," in Advances in Information Retrieval. vol. 3936, M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika, and A. Yavlinsky, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 565-569. N. Shivakumar and H. Garcia-Molina, "Building a scalable and accurate copy detection mechanism," in Proceedings of the first ACM international conference on Digital libraries, Bethesda, Maryland, United States, 1996, pp. 160-168. A. Barrón-Cedeno and P. Rosso, "On automatic plagiarism detection based on ngrams comparison," in Proc. 31st Eur. Conf. IR Res. Adv. Info. Retrieval, 2009, pp. 696-700. E. Stamatatos, "Plagiarism detection using stopword n-grams," Journal of the American Society for Information Science and Technology, vol. 62, pp. 25122527, Sept. 2011. G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38, pp. 39-41, 1995. 34