November 2012 - University of South Australia

advertisement
Masters Computing Minor Thesis
Concept based Tree Structure Representation for
Paraphrased Plagiarism Detection
By
Kiet Nim
nimhy003@mymail.unisa.edu.au
A thesis submitted for the degree of
Master of Science (Computer and Information Science)
School of Computer and Information Science
University of South Australia
November 2012
Supervisor
Dr. Jixue Liu
Associate Supervisor
Dr. Jiuyong Li
i
Declaration
I declare that the thesis presents the original works conducted by myself and does not
incorporate without reference to any material submitted previously for a degree in any
university. To the best of my knowledge, the thesis does not contain any material
previously published or written except where due acknowledgement is made in the
content.
Kiet Nim
November 2012
ii
Acknowledgements
I would like to express my sincere gratitude to my supervisors, Dr. Jixue Liu and Dr.
Jiuyong Li – professors and researchers at University of South Australia, for their
dedicated support, professional advice, feedback and encouragement throughout the
period of conducting the study. In addition, I would like to thank all of my course
coordinators for their dedicated and in-depth teaching. Finally, I would like to thank my
family for always encouraging and providing me their full support throughout the study
in Australia.
iii
Abstract
In the era of World Wide Web, searching for information can be performed easily by the
support of several search engines and online databases. However, this also makes the task
of protecting intellectual property from information abuses become more difficult.
Plagiarism is one of those dishonest behaviors. Most existing systems can efficiently
detect literal plagiarism where exact copy or only minor changes are made. In cases
where plagiarists use intelligent methods to hide their intentions, these PD systems
usually fail to detect plagiarized documents.
The concept based tree structure representation can be the potential solution for
paraphrased plagiarism detection – one of the intelligent plagiarism tactics. By exploiting
WordNet as the background knowledge, concept-based feature can be generated. The
additional feature in combining with the traditional term-based feature and the termbased tree structure can enhance document representation. In particular, this modified
model not only can capture syntactic information like the term-based model does but also
can discover hidden semantic information of a document. Consequently, semantic-similar
documents can be detected and retrieved.
The contributions of the modified structure can be expressed in the following two. Firstly,
a real-time prototype for high level plagiarism detection is proposed in this study.
Secondly, the additional concept-based feature provides considerable improvements for
the task of Document Clustering in the way that more semantic-related documents can be
grouped in same clusters even though they are expressed in different ways. Consequently,
the task of Document Retrieval can retrieve more relevant documents in same topics.
iv
Table of Contents
Declaration .......................................................................................................................... ii
Acknowledgements ............................................................................................................ iii
Abstract .............................................................................................................................. iv
List of Figures ................................................................................................................... vii
List of Tables ................................................................................................................... viii
Chapter 1 – Introduction ..................................................................................................... 1
1.1.
Background .......................................................................................................... 1
1.2.
Motivations........................................................................................................... 1
1.3.
Fields of Thesis .................................................................................................... 2
1.4.
Research Question ................................................................................................ 2
1.5.
Contributions ........................................................................................................ 3
Chapter 2 – Literature Review ............................................................................................ 4
2.1.
Plagiarism Taxonomy .......................................................................................... 4
2.2.
Document Representation .................................................................................... 5
2.2.1.
Flat Feature Representation .......................................................................... 5
2.2.2.
Structural Representation .............................................................................. 8
2.3.
Plagiarism Detection Techniques ....................................................................... 10
2.4.
Limitations ......................................................................................................... 11
Chapter 3 – Methodology ................................................................................................. 13
3.1.
Document Representation and Indexing ............................................................ 13
3.1.1.
Term based Vocabulary Construction ........................................................ 13
3.1.2.
Concept based Vocabulary Construction .................................................... 14
3.1.3.
Document Representation ........................................................................... 15
3.1.4.
Document Indexing ..................................................................................... 16
3.2.
Source Detection and Retrieval .......................................................................... 17
3.3.
Detail Plagiarism Analysis ................................................................................. 18
3.3.1.
Paragraph Level Plagiarism Analysis ......................................................... 18
3.3.2.
Sentence Level Plagiarism Analysis ........................................................... 19
Chapter 4 – Experiments ................................................................................................... 20
4.1.
Experiment Initialization .................................................................................... 20
v
4.1.1.
The Dataset and Workstation Configuration .............................................. 20
4.1.2.
Performance Measures and Parameter Configuration ................................ 21
4.2.
Source Detection and Retrieval for Literal Plagiarism ...................................... 21
4.3.
Source Detection and Retrieval for Paraphrased Plagiarism ............................. 23
4.4.
Study of Parameters ........................................................................................... 25
4.4.1.
Size of Term based Vocabulary 𝑉1𝑇 ............................................................ 25
4.4.2.
Size of Concept based Vocabulary 𝑉1𝐢 ....................................................... 26
4.4.3.
Dimensions of Term based PCA feature .................................................... 27
4.4.4.
Dimensions of Concept based PCA feature ................................................ 28
4.4.5.
Contribution of the Weights 𝑀1 and 𝑀2 ...................................................... 29
Chapter 5 – Conclusion and Future Works ....................................................................... 31
References ......................................................................................................................... 32
vi
List of Figures
Figure 1 - Taxonomy of Plagiarism .................................................................................... 6
Figure 2 - Term-Document Matrix ..................................................................................... 7
Figure 3 - Singular Value Decomposition of term-document matrix A ............................. 8
Figure 4 - 3 layer document-paragraphs-sentences tree representation ............................ 10
Figure 5 - Comparision of the original & modified Porter Stemmers .............................. 13
Figure 6 - Data structure of Term-based Vocabulary ....................................................... 14
Figure 7 - Example of looking for synonyms, hypernyms and hyponyms ....................... 14
Figure 8 - Data structure for the concept-based Vocabulary ............................................ 15
Figure 9 - Concept based Document Tree Representation ............................................... 16
Figure 10 - 2 level SOMs for document-paragraph-sentence document tree ................... 17
Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism ........... 23
Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism .. 24
Figure 13 - Performance based on different sizes of Term based Vocabulary ................. 26
Figure 14 - Performance based on different sizes of Concept based Vocabulary ............ 27
Figure 15 - Performance based on different dimensions of Term based PCA feature ..... 28
Figure 16 - Performance based on different dimensions of Concept based PCA feature . 29
vii
List of Tables
Table 1 - Configuration of Parameters for Literal Plagiarism .......................................... 21
Table 2 - Source Detection & Retrieval for Literal Plagiarism ........................................ 22
Table 3 - Configuration of Parameters for Paraphrased Plagiarism ................................. 23
Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism ............................... 24
Table 5 - Performance based on different sizes of Term-based Vocabulary .................... 25
Table 6 - Performance based on different sizes of Concept based Vocabulary................ 26
Table 7 - Performance based on different dimensions of Term based PCA feature......... 28
Table 8 - Performance based on different dimensions of Concept based PCA feature .... 29
Table 9 - Performance based on different values of 𝑀1 and 𝑀2 ....................................... 30
viii
Chapter 1 – Introduction
1.1.
Background
In the era of World Wide Web, more and more documents are being digitalized and made
available for accessing remotely. Searching for information has become even easier with
the support of variety of search engines and online databases. However, this also makes
the task of protecting intellectual property from information abuses become more
difficult. One of those dishonest behaviors is plagiarism. It is clear that plagiarism has
caused significant damages to intellectual property. Most cases are detected in academic
such as student assignments and research works. Lukashenko et al. [1] define plagiarism
as activities of “turning of someone else’s work as your own without reference to original
source”.
Several systems and algorithms have been developed to tackle this problem. However,
most of them can only detect word-by-word plagiarism or can be referred as literal
plagiarism. These are cases where plagiarists do the exact copy or only make minor
changes to original sources. But in cases where significant changes are made, most of
these “flat feature” based methods fail to detect plagiarized documents [2]. This type is
referred as intellectual or intelligent plagiarism including text manipulation, translation
and idea adoption.
In our research, we focus on improving an existing structural model and conducting
several experiments to test the detection of one tactic of text manipulation - Paraphrasing.
1.2.
Motivations
Paraphrasing is a strategy of intellectual plagiarism to bypass the systems only detecting
exact copy or plagiarized documents with minor modifications. For instance, one popular
and widely used system in academic for plagiarism detection is Turnitin. It can be seen
that Turnitin can detect word-by-word copy efficiently to sentence level. However, by
simply paraphrasing detected terms using their synonyms/hyponyms/hypernyms or
similar phrases, we can bypass it easily. Apparently, paraphrasing is just one of many
existing intelligent tactics. It is clear that plagiarism has become more and more
sophisticated [3]. Therefore, it is also urgent to have more powerful mechanisms to
protect intellectual property from high level plagiarism.
In different plagiarism detection (PD) systems, documents are represented by different
non-structural or structural schemes. Non-structural or flat feature based representations
are the earliest mechanisms for document representation. Systems such as COPS [4] and
SCAM [5] are typical applications of these schemes where documents are broken into
small chunks of words or sentences. These chunks are then hashed and registered against
a hash table to perform document retrieval (DR) and PD. Systems in [6-8] use character
1
or word n-grams as the units for similarity detection. In common, all these flat feature
based systems ignore the contextual information of words/terms in a document. Therefore,
structural representations are recently developed to overcome this limitation. In these
schemes, 2 promising candidates which can capture the rich textual information are
graphs [9, 10] and tree structure representations [2, 11-13]. The applications of structural
representation have shown significant improvements in the tasks of DR, document
clustering (DC) and PD. However, majority of both non-structural and structural
representation schemes are mostly based on word histograms or derived features from
word histograms. Obviously, they can be used to effectively detect literal plagiarism but
are not strong enough to perform such tasks of intelligent plagiarism detection.
In this research, we focus on analyzing the tree structure representation and the studies of
Chow et al. [2, 11-13]. In their works, a document is hierarchically organized into layers.
In this way, the tree can capture not only syntactic but also semantic information of a
document. While the root represents global information or main topics of the document,
other layers capture local information or sub-topics of the main topics and the leaf level
can be used to perform detailed comparison. Their proposed models have improved
significantly the accuracy of DC, DR and PD. However, the features used to represent
each layer are still derived from the term-based Vocabulary and, hence, the systems show
some limitations when performing the task of intelligent plagiarism detection.
Therefore, in our study, we modify the term-based tree structure representation,
particularly, the features used to represent each layer in order to perform the detection of
a specific type of high level plagiarism – Paraphrasing. The modified structure
representation is referred as the Concept based Tree Structure Representation.
1.3.
Fields of Thesis
Document Representation; Information Retrieval; Plagiarism Detection; Text Mining.
1.4.
Research Question
This thesis presents the study and development of a new mechanism to detect one
particular tactic of sophisticated plagiarism – Paraphrasing. We enhance the original tree
structure representation based solely on word histograms with an additional feature to
capture multi-dimensional semantic information. We call the additional feature as the
Concept-based feature. The ultimate aim of our research is to successfully develop a
strong structural representation to discover plagiarism by paraphrasing and, potentially,
higher level plagiarism.
In our experiments, we focus on examining how concept-based feature in combination
with the term-based tree structure representation contributes to the tasks of document
organization, document retrieval and paraphrased plagiarism detection. In the modified
structural representation, each node of the tree is represented by 2 derived vectors of
2
terms and concepts. To overcome the “Curse of Dimensionality” due to the lengths of
these vectors, we further apply Principal Component Analysis (PCA) [14] – a wellknown technique for dimensionality reduction. For the number of tree layers, we choose
the 3-layer document-paragraph-sentence model to represent a document. We also
consider the task of document organization by applying the Self-Organizing Map (SOM)
clustering technique [15]. In document retrieval, we only consider comparing documents
in the same areas since documents mention different topics are regarded as serving no
purpose [16], e.g. comparing a CIS paper against a collection of CIS papers rather than a
CIS paper against biology papers.
To generate the concept-based feature, we consider using the external background
knowledge – WordNet – to firstly generate the concept-based Vocabulary. After that, the
concept-based feature is derived from this Vocabulary and together with the term-based
feature used to represent a document.
1.5.
Contributions
In this thesis, we introduce the C-PCA-SOM 3-stage prototype for high level Plagiarism
Detection including: Stage 1 – Document Representation & Indexing; Stage 2 – Source
Detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. In addition, due to the
achievement in constant processing time when conducting experiments, the prototype can
provide real-time applications for Document Representation, Document Clustering,
Document Retrieval and, potentially, Paraphrased Plagiarism Detection.
Through experiments, it is verified that the introduction of the additional Concept-based
feature can improve the performance of Source Detection and Retrieval comparing with
models solely based on Term-based feature. Furthermore, it is proved that the enhanced
tree structure representation not only can capture syntactic information like the original
scheme does but also can discover hidden semantic information of a document. By
capturing multi-dimensional information of a document, the task of Document Clustering
can be improved in the way that more semantic-related documents can be grouped into
meaningful clusters even though they are expressed differently. As a result, Document
Retrieval can also be benefited since more documents mentioning the same topics can be
detected and retrieved.
3
Chapter 2 – Literature Review
This section provides a comprehensive overview of literature on different types of
plagiarism in section 2.1. In section 2.2, variety of document representation schemes are
discussed including non-structural or flat feature based representation and structural
representation. Existing plagiarism detection techniques are outlined in section 2.3.
Finally, the limitations of these PD techniques and representation schemes are discussed
for potential improvements and further study in section 2.4.
2.1.
Plagiarism Taxonomy
When human began the artifact of producing papers as a part of intellectual
documentation, plagiarism also came into existence. Documentation and plagiarism exist
in parallel but they are two different sides of a coin totally. While one contributes to the
knowledge body of human society, the other causes serious damages to intellectual
property. Realizing this matter of fact, ethical community has developed many techniques
to fight against plagiarism. However, the battle against this phenomenon is a lifelong
battle since plagiarism has also evolved and become more sophisticated. Therefore, to
efficiently engage such devious enemy, it is necessary to have a mapping scheme to
identify and classify different types of plagiarism into meaningful categories. Many
studies have been conducted to perform this task [1, 3, 17]. Lukashenko et al. [1] point
out different types of plagiarism activities including:
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
Copy-paste plagiarism (word to word copying).
Paraphrasing (using synonyms/phrases to express same content).
Translated plagiarism (copying content expressed in another languages).
Artistic plagiarism (expressing plagiarized works in different formats such as
images or text).
Idea plagiarism (extracting and using others’ ideas).
Code plagiarism (copying others’ programming codes).
No proper use of quotation marks.
Misinformation of references.
More precisely, Alzahrani et al. [3] use a taxonomy and classify different types of
plagiarism into 2 main categories: literal and intelligent plagiarism Fig. 1. In the former,
plagiarists make exact copy or only few changes to original sources. Thus, this type of
plagiarism can be detected easily. However, the latter case is much more difficult to
detect where plagiarists try to hide their intentions by using many intelligent ways to
change original sources. These tactics include: text manipulation, translation and idea
adoption. In text manipulating, plagiarists try to change the appearance of the text while
keeping the same semantic meaning or idea of the text. Paraphrasing is one tactic of text
manipulation that is usually performed. It transforms text appearance by using synonyms,
hyponyms, hypernyms or equivalent phrases. In the research, our main focus is to detect
4
this type of intelligent plagiarism. Plagiarism based on Translation is also known as
cross-lingual plagiarism. Offenders can use some translation softwares to copy text
written in other languages to bypass monolingual systems. Finally, Alzahrani et al.
consider idea adoption is the most serious and dangerous type of plagiarism since
stealing ideas from others’ works without properly referencing is the most disrespectful
action toward their authors and intellectual property. Apparently, this type of plagiarism
is also the hardest to detect because plagiarized text might not carry any similar syntactic
information to original sources. Another reason is that the ideas being plagiarized can be
extracted from multiple parts of original documents.
2.2.
Document Representation
Since there are a vast number of documents available online and many more are uploaded
every day, the demand of efficiently organizing and indexing them for fast retrieval
always poses challenges for research community. Many schemes have been developed
and improved to represent a document effectively. Instead of using a whole document as
a query, these representations can be applied to perform many text processing related
tasks such as classification, clustering, document retrieval and plagiarism detection. This
section discusses 2 main strategies of document representation as well as available
techniques to detect plagiarism.
2.2.1. Flat Feature Representation
One of the most popular and widely used model is the Vector Space Model (VSM) [18].
In this model, a weighted vector of term frequency and document frequency is calculated
based on a pre-constructed Vocabulary. This Vocabulary is the list of most frequent
words/terms and is previously derived from a given training corpus. The scheme used to
perform term weighting is TF-IDF. Term frequency (TF) counts the number of
occurrences of each term in one specific document while inverse document frequency
(IDF) counts the number of documents consist one specific term. In the VSM model, a
vector of word histograms is constructed for each document. All vectors together form
the term-document matrix Fig. 2. The similarity between two documents is calculated by
performing the Cosine distance function on their vectors [19]. One drawback of the VSM
model is that vectors used to represent documents are usually lengthy due to the size of
the Vocabulary and hence not scalable for large datasets.
5
Figure 1 - Taxonomy of Plagiarism
6
Figure 2 - Term-Document Matrix
To overcome the Curse of Dimensionality in VSM, Latent Semantic Indexing (LSI) [20]
is proposed to project lengthy vectors to lower number of dimensions with semantic
information preserved by mapping the space spanned by these vectors to a lower
dimensional subspace. The mapping is performed based on the Singular Value
Decomposition (SVD) of the VSM-based term-document matrix Fig. 3. Similarly,
another approach for high dimension reduction and feature compression is to apply SelfOrganizing Map (SOM) [15]. In SOM, similar documents are organized closely to each
other. Instead of being represented by word histogram vectors, each document is
represented by its winner neuron or Best Matching Unit on the map. The applications of
SOM such as VSM-SOM [21], WEBSOM [22], LSISOM [23] and in [21, 24] has shown
considerable speed up in document clustering and retrieval. SOM can be utilized in
combining with not only flat feature representation but also structural representation
discussed later in the next section.
7
Figure 3 - Singular Value Decomposition of term-document matrix A
Considering only relying on “bag of words” might not be enough, there are many further
studies that propose adding additional features and together with term-based flat features
to enhance document representation. In [25], Xue et al. propose using distributional
features in combination with the traditional term frequency to improve the task of text
categorization. The proposed features include: the compactness of the appearances of a
word and the position of the first appearance of a word. Basing on these features, they
assign specific weight to a word based on its compactness and position, i.e. authors likely
to mention main contents in the earlier parts of a document. Hence, words appear in these
parts are considered more important and assigned higher weights. Similarly, another
approach to “enrich” document representation is to utilize external background
knowledge such as WordNet, Wikipedia or thesaurus dictionaries. In [26], Hu et al. use
Wikipedia to derive 2 more additional features which are concept-based and categorybased features based on the conventional term-based feature. Their experiments have
shown significant improvements in the task of document clustering. Similar applications
of external background knowledge can be found in [26-31]. In our study, we take into
account the application of WordNet instead of Wikipedia to generate the concept-based
feature and use it to enhance document representation.
2.2.2. Structural Representation
By only using word histogram vectors to represent documents, it can be seen that flat
feature representation ignores the contextual usage and relationship of terms throughout a
document [2] and hence leads to the loss of semantic information. In addition, two
documents might be contextually different even though they contain the same term
distribution. Recognizing this serious limitation, many studies are further carried out
trying to develop new ways to represent a document which can capture not only syntactic
but also semantic information of a document. These new schemes are referred as
structural representation.
8
To capture semantic information, Schenker et al. [9] propose using a directed graph
model to represent documents. The graph structure consists of two components: Nodes
and Edges. Nodes (Vertices) are terms appear in a document weighted by the number of
appearances. Edges that link nodes together indicate the relationship between terms. An
edge is only formed between two terms that appear immediately next to each other in a
sentence. Chow et al. [10] also study the directed graph and further develop another type
of graph – the undirected graph. Similarly, their directed model considers the order of
term occurrence in a sentence. In the additional model, the connection of terms in
undirected graph is considered without taking the usage order of terms into consideration.
They further perform Principal Component Analysis (PCA) for dimensionality reduction
and SOM for document organization. Their experiments show significant improvements
comparing with other single feature based approaches.
Another group of models which can capture both syntactic and semantic information of a
document is the group of tree-based representation models. The earliest study of the tree
structure representation is conducted by Si et al. [16]. They significantly realize that it is
unnecessary to compare documents addressing different subjects. Their model organizes
a document according to its structure and hence form the tree, i.e. a document may
contain many sections, a section may contain many subsections, a subsection against
might also have many sub-subsections, etc. This mechanism significantly improves the
effectiveness of document comparison since lower level comparisons can be terminated if
higher levels are different. However, the lengthy vectors at each level and potential high
number of layers make their model not scalable for big corpuses. Most recent works of
Chow et al. [2, 11-13] use fixed number of layers (2 or 3 layers), reduced size of termbased Vocabularies and perform PCA compression to make the tree structure
representation applicable for large datasets. To minimize the time complexity of
document retrieval, they further apply SOM to organize documents according to their
similarities [2, 11]. Fig. 4 denotes the 3-layer document-paragraph-sentence tree
representation in [12] and also the model which we focus on improving in the study.
Other alternatives of layers and representation units for layers can be found in [2, 11, 13].
9
Figure 4 - 3 layer document-paragraphs-sentences tree representation
2.3.
Plagiarism Detection Techniques
According to Lukashenko et al. [1], the task of fighting plagiarism is classified into
plagiarism prevention and plagiarism detection. The main difference between two
classes is that detection requires less time to implement but can only achieve short-term
positive effect. On the other hand, although prevention methods are time consuming to
develop and deploy, they have long-term effect and hence are considered to be a more
significant approach to effectively fight plagiarism. Prevention, unfortunately, is a global
issue and cannot be solved by just one institution. Therefore, most existing techniques
fall in the detection category and a lot of researches have been conducted to develop
more powerful plagiarism detection techniques.
Alzahrani et al. [3] categorize plagiarism detection techniques into 2 broad trends:
intrinsic and extrinsic plagiarism detection. Intrinsic PD techniques analyze a suspicious
document locally or without collecting a set of candidate documents for comparison [32,
33]. These approaches employ some novel analyses based on authors’ writing styles.
Since they are based on the hypothesis that each writer has a unique writing style, thus
changing in writing style signals there might be potential plagiarism cases. Features used
for this type of PD are Stylometric Features based on text statistics, syntactic features,
POS features, closed-class word sets and structural features. On the other hand, extrinsic
PD techniques do the comparison between query documents against a set of source
documents. Most of the existing PD systems are deploying these extrinsic techniques.
There are several common steps to perform extrinsic PD. Firstly, Document Indexing is
applied to store all registered documents into databases for later retrieval. Secondly,
Document Retrieval is performed to retrieve most relevant candidates that might be
10
plagiarized by given query documents. Eventually, Exhaustive Analysis is carried out
between candidates and query documents to locate plagiarized parts.
For extrinsic PD techniques, majority of exhaustive analysis methods partition all
documents into blocks (n-gram or chunks) [4, 34-36]. Units in each block can be
characters, words, sentences, paragraphs, etc. These blocks are then hashed and registered
against a hash table. To perform PD, suspicious documents are also divided into small
blocks and looked up in the hash table. Eventually, similar blocks are retrieved for
detailed comparison. COPS [4] and SCAM [5] are two typical implementations of these
approaches. According to [2, 16], these methods are inapplicable for large corpus due to
the increasing number of documents from time to time. Furthermore, they can be
bypassed easily by making some changes at sentence level.
It is noticed that the methods mentioned above apply flat features only and ignore the
contextual information of how words/terms are used throughout a document. Two
documents have the same term distribution might be different contextually. To tackle this
problem, PD systems which utilized structural representation are then proposed [2, 12,
16]. These approaches significantly improve the performance of extrinsic plagiarism
detection. Since documents are hierarchically organized into multi levels, the comparison
can be terminated at any level where the amount of dissimilarity is over a user-defined
threshold between query and candidate documents. Experiments in these structure-based
models have shown better performance comparing with flat feature based systems.
2.4.
Limitations
Most existing PD systems are implemented basing on flat feature representation. As
mentioned in 2.1, they cannot capture contextual usage of words/terms throughout a
document and can be bypassed easily with minor modifications performed on original
sources. Structural representation based PD systems have made significant improvements
in capturing the rich textual information. By organizing documents hierarchically,
structural models can capture not only syntactic but also semantic information of a
document. Recent studies have shown some important contributions of structural
representation in document organization such as classification and clustering [2, 11, 13].
Consequently, the task of plagiarism detection is also improved in term of time
complexity reduction and achievement of higher detection accuracy. Most relevant
documents are firstly retrieved to narrow the processing scope and further comparisons
are terminated at levels where similarities are far.
Although it has been proved that structural presentation can be applied to detect literal
plagiarism efficiently, structural representation based PD systems still show some
limitations in detecting intelligent plagiarism. For example, plagiarists can perform
paraphrasing and replace the detected words/terms with their synonyms, hyponyms or
hypernyms to bypass the detection of these systems. It is noticed that the problems arise
11
from the term-based Vocabulary. In this type of Vocabulary, terms with similar meanings
are treated as unrelated. For instance, it can be seen that large, huge and enormous carry
similar meaning and are exchangeable in usage. However, in this type of Vocabulary,
they are considered as different terms. Therefore, any feature derived from this
Vocabulary is not strong enough to detect sophisticated plagiarism. By changing
words/terms of an original sentence with semantic similar words/terms, a plagiarized
sentence will be treated as a different sentence.
In order to discover similar sentences even though they are expressed in different ways,
our study exploits the external background knowledge WordNet to construct the conceptbased Vocabulary. The additional Vocabulary is built by grouping words with similar
meaning in the term-based Vocabulary into one concept. After that, we use this
Vocabulary to generate one more feature called the Concept-based feature to enrich the
representation of a document.
12
Chapter 3 – Methodology
This section outlines the main techniques we apply to develop the prototype for
paraphrasing detection. We call it the 3-stage prototype including: Stage 1 – Document
representation & Indexing, Stage 2 – Source detection & Retrieval and Stage 3 – Detail
Plagiarism Analysis. The content of Stage 1 is discussed in section 3.1 consisting of the
construction of two types of Vocabulary, the extraction of the corresponding 2 types of
feature to represent a document and, finally, the application of SOM to organize
documents into meaningful clusters. Section 3.2 gives the details of Stage 2 on how to
use the stored data in Stage 1 to perform fast original source identification and retrieval.
Finally, Stage 3 of the prototype performs plagiarism detection in details based on
retrieved candidate documents from Stage 2. The mechanism of the detail analysis of
Stage 3 is outlined in section 3.3.
3.1.
Document Representation and Indexing
3.1.1. Term based Vocabulary Construction
The construction of the term-based Vocabulary is straight forward. Firstly, we perform
term extraction from a training corpus. After that, we further apply Word Stemming to
transform terms to their simple forms. For example, words such as “computes”,
“computing” and “computed” are all considered as “compute”. Because the original
Porter stemming algorithm only creates “stems” instead of words in their simple forms
and makes it impossible to look up for them on an English dictionary or thesaurus, we
have modified the Porter algorithm. The modified version now tries to stop at the stage
where words are in or near their simple forms. As a result, it is possible to search for
these words’ synonyms, hypernyms and hyponyms via, for example, a thesaurus
dictionary. Fig. 5 depicts the different between the original and modified Porter stemmers.
Figure 5 - Comparision of the original & modified Porter Stemmers
After applying stemming, we subsequently perform stop word removal in order to
remove insignificant words such as “a”, “the”, “are”, etc. Finally, we use the weighting
13
scheme TF-IDF (Term Frequency – Inverse Document Frequency) to weight the
significance or importance of each word throughout the corpus. The weights of all terms
are then ranked from highest to lowest (from most to less significant). In similar to Chow
et al. [2, 12], we select the first 𝑛1 terms to form the Vocabulary 𝑉1𝑑 used for Document
and Paragraph levels of the tree structure representation. The first 𝑛2 terms are selected to
form the Vocabulary 𝑉2𝑑 which is used for Sentence level. In addition, 𝑛2 is much larger
than 𝑛1 . The data structure of the two term-based Vocabularies is denoted in Fig. 6.
Figure 6 - Data structure of Term-based Vocabulary
3.1.2. Concept based Vocabulary Construction
In order to construct the additional concept-based Vocabulary, we exploit one of the
background knowledge, WordNet – the lexical database for English language [37].
WordNet is developed by Miller began in 1985 and has been applied in many text-related
processing tasks such as document clustering [29], document retrieval [28, 30] and also
for word-sense disambiguation. In WordNet; nouns, verbs, adjectives and adverbs are
distinguished and organized into meaningful sets of synonyms. In this section, we outline
the mechanism to utilize WordNet for the construction of the concept-based Vocabulary.
Firstly, for each term T in the term-based Vocabulary, we look up for its synonyms,
hypernyms and hyponyms on WordNet by using the synonym-hypernym-hyponym
relationship of the ontology. The result of this step is a “bag” of terms similar to T Fig. 7.
After that, we check for these terms’ appearances in the term-based Vocabulary. Any
term, which does not appear in the term-based Vocabulary, is removed and the remaining
terms generate one concept. This process is repeatedly performed for the whole termbased Vocabulary to achieve the concept-based Vocabulary. Fig. 8 denotes the data
structure for the additional Vocabulary.
Figure 7 - Example of looking for synonyms, hypernyms and hyponyms
14
Figure 8 - Data structure for the concept-based Vocabulary
Similar to the construction of Vocabularies 𝑉1𝑑 and 𝑉2𝑑 , we select the first π‘š1 concepts to
form the Vocabulary 𝑉1𝑐 used for Document and Paragraph levels and the first π‘š2
concepts to form the Vocabulary 𝑉2𝑐 used for Sentence level (π‘š2 is also much larger than
π‘š1 ).
3.1.3. Document Representation
After the construction of the two types of Vocabulary, we subsequently store them to
hard drive and now we are ready to perform the computation of each document’s
representation. In our research, we choose the Document-Paragraph-Sentence 3-layer
tree representation in [12]. Following Zhang et al., each document is firstly partitioned
into paragraphs and each paragraph is similarly partitioned into sentences. This process
builds the 3 layer tree representation for each document. The root node represents the
whole document, second layer captures information of paragraphs of the document and
each paragraph has its sentences situating at corresponding leaf nodes.
The modification of the original tree structure is formally carried out in the feature
construction steps for each layer. For all layers, term extraction, stemming and stop word
removal are still applied to only extract significant terms. For the top and second layers,
term-based vectors are still derived normally by performing the checking and weighting
process of terms that appear in the term-based Vocabulary 𝑉1𝑑 . At the same time, we
perform mapping all of these terms to their concepts based on the Vocabulary 𝑉1𝑐 . The
weight of a concept is the sum of all elements’ weights. For the bottom layer, instead of
using word histograms, we use “appearance indices of terms” vector to indicate the
absence/presence of corresponding terms in a sentence, similar to [12]. In addition,
“appearance indices of concepts” vector is utilized to indicate the absence/presence of
corresponding concepts in the sentence. Up to this stage, each node of the tree is
represented by 2 features: the term-based feature and the additional concept-based feature.
To overcome the “Curse of Dimensionality”, we still use the Principal Component
Analysis (PCA) to compress the features on Document and Paragraph levels. PCA is one
of the well-known tools for feature compression and high dimension reduction. We use
the same training corpus as the one used for constructing the Vocabularies to calculate
15
two PCA rotation matrices independently for term and concept features. We also store the
matrices to hard disk in order to apply for query documents later in the stage of Source
Detection and Retrieval. The PCA-projected features are calculated as below:
𝑭𝑷π‘ͺ𝑨
= π‘­π’Œ ∗ 𝑹 π’Œ × π’
𝒍
(1)
Where πΉπ‘˜ = {𝑓1 , 𝑓2 , …, π‘“π‘˜ } is the normalized term- or concept-based histograms with k
dimensions, π‘…π‘˜ × π‘™ is the k x l PCA rotation matrix and 𝐹𝑙𝑃𝐢𝐴 is the resulted PCAcompressed feature with reduced dimensions of l (l is much smaller than k).
Finally, the data of all documents’ trees is stored to later perform the similarity
calculation between suspicious and original documents, paragraphs or sentences for both
Source Detection – Retrieval and Detail Plagiarism Analysis. The concept-based tree
structure representation of a document is illustrated in Fig. 9.
Figure 9 - Concept based Document Tree Representation
3.1.4. Document Indexing
According to Si et al. [16], it is unnecessary to compare documents mentioning different
topics. Therefore, document organization is crucial to avoid redundant comparisons and
minimize processing time as well as computational complexity. For this reason, we apply
one clustering technique to organize similar documents into same clusters. The chosen
clustering method is Self Organizing Map (SOM) due to its flexibility and time efficiency.
All documents in an experiment dataset have their trees organized on the map. We
construct 2 SOMs for the root and paragraph levels in the same manner as [2]. Initially,
16
the SOM of the paragraph level is built by mapping all paragraphs’ PCA-compressed
term- and concept-based features of all documents on the map. The results of the 2nd level
SOM are then used as parts of the inputs for the root level SOM. For the root level SOM,
features of the root of each document’s tree in combining with the resulted winner
neurons (also known as Best Matching Units - BMU) of the corresponding child
paragraphs form the input for the top map. They are then mapped to their nearest BMUs
on this root SOM. The mapping process is repeated a number of times in order for all
similar documents to converge on the two maps. Eventually, the data of the SOMs is stored
to perform fast source detection and retrieval in Stage 2. Fig. 10 illustrates how a document
tree is organized onto the document and paragraph SOMs.
Figure 10 - 2 level SOMs for document-paragraph-sentence document tree
3.2.
Source Detection and Retrieval
Stage 2 of the prototype is for the detection and retrieval of source documents or relevant
candidates providing suspicious documents. For each query document, in the similar way
of when constructing the tree representation for corpus documents in Stage 1, we firstly
use the stored term- and concept-based Vocabularies to build the query’s tree
representation. Secondly, we load the stored PCA projection matrices and perform
feature compression for the query tree representation. After that, we use the root node and
find its Best Matching Unit on the document-level SOM. Subsequently, n candidate
documents associated with the BMU are retrieved. If the number of documents related to
the BMU is less than n, we will retrieve the remaining documents from the BMU’s
nearest neighbors that contain documents more similar to those in the BMU.
17
For the n candidate documents, we compute the summed Cosine distance of term- and
concept-based PCA vectors between the query document and these candidates. The
formula to calculate the summed Cosine distance or overall similarity is defined as:
𝑫(𝒒, 𝒄) = π’˜πŸ . 𝒅(π’‡πŸπ’’ , π’‡πŸπ’„ ) + π’˜πŸ . 𝒅(π’‡πŸπ’’ , π’‡πŸπ’„ )
Where
(2)
𝑀1 + 𝑀2 = 1
q, c: query and candidate documents
𝑓1π‘ž , 𝑓1𝑐 : Term-based PCA projected features
𝑓2π‘ž , 𝑓2𝑐 : Concept-based PCA projected features
d: Cosine distance function
The overall similarity is the sum of individual similarities of different types of feature. 𝑀1
and 𝑀2 are the weights used to emphasize the importance of different features to the
overall similarity. In experiments, we assign different weights to each feature to evaluate
the degree of contribution of different features on the performance of source detection
and retrieval.
After the calculation of the summed Cosine distances, we rank them in ascending order
and choose t documents that have distances lower than the user-defined similarity
threshold π·π‘‘π‘‘π‘œπ‘ for further analysis. The threshold π·π‘‘π‘‘π‘œπ‘ is calculated as below:
𝒅𝒐𝒄
𝒅𝒐𝒄
𝑫𝒅𝒐𝒄
= 𝑫𝒅𝒐𝒄
𝒕
π’Žπ’Šπ’ + 𝜺(π‘«π’Žπ’‚π’™ − π‘«π’Žπ’Šπ’ )
(3)
Where 𝜺 ϡ [0, 1], it is noted that 𝜺 = 0 is equivalent to single source retrieval, i.e. only the
most similar document will be retrieved.
3.3.
Detail Plagiarism Analysis
The third stage of the prototype involves the detection of local similarity or identifies
candidate paragraphs that are most similar to each suspicious paragraph of the query
document. By doing so, we can avoid the exhaustive sentence comparison for unrelated
paragraphs and speed up the detection process. On the other hand, sentence comparison is
carried out for each sentence of the suspicious paragraph and candidate paragraphs to
locate potential plagiarism cases. These cases are then summarized and reported to user
for human assessment.
3.3.1. Paragraph Level Plagiarism Analysis
For each suspicious paragraph of the query document, we similarly find its BMU on the
paragraph level SOM. However, we only retrieve paragraphs that belong to the t detected
candidates in Stage 2. We do not retrieve other paragraphs even though they are also
associated with the BMU because their parent nodes are different from the suspicious
paragraph’s parent node (i.e., documents mention different topics).
18
After completely retrieving all candidate paragraphs, we again calculate the summed
Cosine distances between these paragraphs and the corresponding suspicious paragraph.
Next, we rank the distances also in ascending order and select t’ paragraphs that have
distances lower than the similarity threshold π·π‘‘π‘π‘Žπ‘Ÿπ‘Ž to further perform the exhaustive
sentence level plagiarism analysis. The threshold π·π‘‘π‘π‘Žπ‘Ÿπ‘Ž is as the following:
𝒑𝒂𝒓𝒂
𝑫𝒕
𝒑𝒂𝒓𝒂
𝒑𝒂𝒓𝒂
𝒑𝒂𝒓𝒂
= π‘«π’Žπ’Šπ’ + 𝜺′(π‘«π’Žπ’‚π’™ − π‘«π’Žπ’Šπ’ )
(4)
Where 𝜺′ Ο΅ [0, 1] and 𝜺′ = 0 is for single plagiarized paragraph detection, i.e. only the
most similar paragraph will be retrieved.
3.3.2. Sentence Level Plagiarism Analysis
After the retrieval of the most relevant paragraphs for each suspicious paragraph, for each
pair of original and suspicious paragraphs, we perform the exhaustive comparison for all
of their sentences using the corresponding leaf nodes of the tree representations. Because
we use appearance indices of terms and concepts rather than histograms as features for
the bottom layer, the calculation of sentence similarity is slightly different. We define the
similarity of 2 sentences as the amount of overlap between their terms and concepts. The
sentence similarity is calculated by the following formula:
𝑢𝒗𝒆𝒓𝒍𝒂𝒑
𝑰𝒒𝒄
Where
= π’˜πŸ
𝑺𝒆𝒏
𝑺𝑺𝒆𝒏
𝒒,π’‡πŸ ∩ 𝑺𝒄,π’‡πŸ
𝑺𝒆𝒏
𝑺𝑺𝒆𝒏
𝒒,π’‡πŸ ∪ 𝑺𝒄,π’‡πŸ
+ π’˜πŸ
𝑺𝒆𝒏
𝑺𝑺𝒆𝒏
𝒒,π’‡πŸ ∩ 𝑺𝒄,π’‡πŸ
𝑺𝒆𝒏
𝑺𝑺𝒆𝒏
𝒒,π’‡πŸ ∪ 𝑺𝒄,π’‡πŸ
(5)
𝑀1 + 𝑀2 = 1
𝑆𝑒𝑛
𝑆𝑒𝑛
π‘†π‘ž,𝑓1
, 𝑆𝑐,𝑓1
: Appearance indices of Terms
𝑆𝑒𝑛
𝑆𝑒𝑛
π‘†π‘ž,𝑓2
, 𝑆𝑐,𝑓2
: Appearance indices of Concepts
The overall overlap between a query sentence and a candidate sentence is the sum of
individual overlaps of different types of feature. If the summed overlap is larger than the
𝑢𝒗𝒆𝒓𝒍𝒂𝒑
overlap threshold 𝑰𝒒𝒄
> α Ο΅ [0.5, 1], then this pair of paragraphs is considered as one
plagiarism case. In addition, user can flexibly change the overlap threshold to detect more
or less plagiarism cases. For example, if α = 0.8, it means that any pair of paragraphs that
has the degree of overlap of more than 80% is then considered as a plagiarism case. The
exhaustive process is repeatedly performed for other remaining pairs of paragraphs.
Finally, all plagiarism cases are presented to user for human review.
19
Chapter 4 – Experiments
This section outlines the experiments we carry out to test the performance of our
prototype. Firstly, Section 4.1 introduces the dataset used for training and testing
plagiarism detection as well as the configuration of the experiment workstation. Up to the
point of writing up this thesis, we have conducted experiments on the performance of
Source Detection and Retrieval of our system. Further experiments to test the full
functionality of our prototype are addressed in Chapter 5 as future works. For the
experiments on Source Detection and Retrieval, we compare the results of our prototype
(C-PCA-SOM) against variety systems including the original tree-based retrieval (PCASOM) in [2] and the traditional VSM model. For the candidate of latent semantic
compression, many previous studies have chosen the LSI model. Therefore, in our study,
we would like to provide the comparison with the PCA model instead. All comparative
models are slightly modified to use the same modified Porter stemmer like our model.
For the 2 SOM-based models, we only use their top SOM maps for document retrieval
and temporarily ignore the contribution of the second layer maps. Section 4.2 provides
the results of Source Detection and Retrieval for Literal Plagiarism and Section 4.3 is for
Paraphrased Plagiarism. In addition, we also carry out the empirical tests to study the
contribution of different parameters to the accuracy and optimization of our system such
as the weights (w1 , w2 ) or the dimensions of term- and concept- based PCA features. The
details are reported in Section 4.4.
4.1.
Experiment Initialization
4.1.1.
The Dataset and Workstation Configuration
We use the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11) dataset to test our
system. This dataset formed part of PAN 2010 – the international competition on
plagiarism detection and is available to be downloaded at the following address
http://www.webis.de/research/corpora/corpus-pan-pc-10. In details, there are 7,859 candidate
documents in total to make up the original corpus. The test set also contains 7,859
suspicious documents equivalent to those in the corpus, i.e. each document in the corpus
has exactly one plagiarized document in the test dataset. The test set consists of 3,792
documents of literal plagiarism (non-paraphrased) cases and 4,067 documents of
paraphrased plagiarism cases. For each type of plagiarism, we construct multiple pairs of
sub corpus and sub test set with the sizes of 50, 100, 200 and 400 randomly selected
documents to evaluate our system in different data scales. For each sub corpus, we
perform the processes described in Stage 1 of the 3-stage prototype to firstly organize its
documents and later use corresponding sub test set for evaluation.
The experiments, computation of PCA rotation matrix and SOM clustering are conducted
on a PC with 2.2 GHz Core 2 Duo CPU and 2GB of RAM.
20
4.1.2. Performance Measures and Parameter Configuration
To provide comparable results between different models, we use Precision and Recall to
evaluate their performance. The computations of Precision and Recall are listed as follow:
π‘·π’“π’†π’„π’Šπ’”π’Šπ’π’ =
𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑡𝒐. 𝒐𝒇 π’„π’π’“π’“π’†π’„π’•π’π’š π’“π’†π’•π’“π’Šπ’†π’—π’†π’… π’…π’π’„π’–π’Žπ’†π’π’•π’”
𝑡𝒐. 𝒐𝒇 π’“π’†π’•π’“π’Šπ’†π’—π’†π’… π’…π’π’„π’–π’Žπ’†π’π’•π’”
(6)
𝑡𝒐. 𝒐𝒇 π’„π’π’“π’“π’†π’„π’•π’π’š π’“π’†π’•π’“π’Šπ’†π’—π’†π’… π’…π’π’„π’–π’Žπ’†π’π’•π’”
𝑡𝒐. 𝒐𝒇 𝒓𝒆𝒍𝒆𝒗𝒂𝒏𝒕 π’…π’π’„π’–π’Žπ’†π’π’•π’” π’Šπ’ 𝒄𝒐𝒓𝒑𝒖𝒔
(7)
Since each original document has exactly one plagiarized document, thus we only
consider if the first retrieved document is the correct candidate or not. Hence, we set the
scaling parameter ε = 0 in formula (3). In this stage, we assume that the contributions of
term- and concept-based features are equivalent and set 𝑀1 = 𝑀2 = 0.5 in formula (2).
The empirical study of these parameters is outlined later in section 4.4.
Apparently, it is noticed that Precision and Recall are equal when using the Webis-CPC11 dataset. Therefore, we use PR to indicate both of them and further add in the “No of
correct retrieval” measure to indicate the number of correctly detected documents for
suspicious documents in the test sets.
4.2.
Source Detection and Retrieval for Literal Plagiarism
To begin with, we arbitrarily set the parameters for each sub corpus as in Table 1. Our
model, the C-PCA-SOM, use all of these parameters while the PCA-SOM ignores the
concept-based vocabulary and concept-based PCA feature dimensions. The PCA model
only uses the term-based vocabulary and term-based PCA feature dimensions. The VSM
model only uses the term-based Vocabulary to construct its term-document matrix. In
addition, we set 𝑀1 = 𝑀2 = 0.5 as mentioned above to indicate the same amount of
contribution between term- and concept-based features.
Corpus size
50
100
200
400
π‘½π‘»πŸ size
1500
2500
3500
5000
𝑽π‘ͺ𝟏 size
1000
2000
2500
3500
T/C PCA dimensions
40/40
80/80
130/130
220/220
SOM size
6x8
7x8
8x8
8x9
SOM iterations
100
150
200
200
Table 1 - Configuration of Parameters for Literal Plagiarism
The results of different models are reported in Table 2 and Fig. 11. The diagram
illustrates the PRs of different systems in detecting source documents for Literal
Plagiarism cases. It is noticed that our model produces competitive results with other
models for the case of single source detection. For the corpus sizes of 100 and 200, our
system is even slightly better than the PCA-SOM without concept-based feature. For the
corpus size of 50, all systems generate the same results of 96%. It is observed that PCA
21
and VSM seem to perform better for single source detection cases. Even though it takes
more retrieval time for PCA and VSM, these models compare per query document with
all documents in the corpuses and the possibility of missing the real candidate is unlikely.
For SOM-based models such as ours and the PCA-SOM, fast retrieval depends on the
results from the earlier clustering process. The clarification is outlined in Section 4.4
where the change of any parameter can affect the accuracy of document clustering and
consequently document retrieval.
Corpus size
50
100
200
400
Algorithms
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
No of correct
retrieval
48/50
48/50
48/50
48/50
91/100
88/100
91/100
91/100
182/200
181/200
185/200
187/200
355/400
355/400
359/400
361/400
PR
0.96
0.96
0.96
0.96
0.91
0.88
0.91
0.91
0.91
0.905
0.925
0.935
0.8875
0.8875
0.8975
0.9025
Table 2 - Source Detection & Retrieval for Literal Plagiarism
22
100%
95%
PR
90%
C-PCA-SOM
85%
PCA-SOM
PCA
80%
VSM
75%
70%
50
100
200
400
Corpus Size
Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism
4.3.
Source Detection and Retrieval for Paraphrased Plagiarism
In the same manner as conducting the test for Literal Plagiarism, arbitrarily parameters
are set firstly. Table 3 denotes the parameter configuration for specific corpuses. The
results of Source Detection and Retrieval for Paraphrased Plagiarism cases are reported in
Table 3 and Fig. 12. 𝑀1, 𝑀2 are still kept the same as in 4.2.
Surprisingly, in the case of Paraphrased Plagiarism, the PCA-SOM model performs better
than the C-PCA-SOM in detecting corresponding candidates. In addition, since only
global information is involved in retrieval, the exhaustive VSM and PCA still produce
better results in finding single source document for a suspicious document. It can be due
to the overall topic of a paraphrased document still remains the same as the original
document. For different performance of the two SOM-based models, we further
investigate the contribution of concept-based feature to the performance of clustering and
later retrieval. At this stage, it can be assumed that concept-based feature might introduce
noise to the clustering process. To clarify this assumption, we try different values of the
weights 𝑀1 and 𝑀2 . The results are reported in Section 4.4.
Corpus size
50
100
200
400
π‘½π‘»πŸ size
1700
2700
3800
5500
𝑽π‘ͺ𝟏 size
1100
2300
3000
4000
T/C PCA dimensions
45/45
90/90
140/140
240/240
SOM size
6x8
7x8
8x8
8x9
Table 3 - Configuration of Parameters for Paraphrased Plagiarism
23
SOM iterations
100
150
200
200
Corpus Size
50
100
200
400
Algorithms
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
C-PCA-SOM
PCA-SOM
PCA
VSM
No of correct
retrieval
44/50
46/50
43/50
46/50
83/100
86/100
90/100
93/100
160/200
167/200
174/200
183/200
274/400
288/400
310/400
333/400
PR
0.88
0.92
0.86
0.92
0.83
0.86
0.9
0.93
0.8
0.835
0.87
0.915
0.685
0.72
0.775
0.8325
Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism
100%
95%
90%
PR
85%
C-PCA-SOM
80%
PCA-SOM
75%
PCA
70%
VSM
65%
60%
50
100
200
400
Corpus Size
Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism
24
4.4.
Study of Parameters
This section provides a comprehensive and empirical study of different parameters to the
performance of the C-PCA-SOM model including: the size of term- and concept-based
Vocabularies in Sections 4.4.1 and 4.4.2, the dimension of term- and concept-based PCA
features in Sections 4.4.3 and 4.4.4 and, lastly, the contribution of weighting parameters
𝑀1 and 𝑀2 in Section 4.4.5. The experiments are carried out on a compound corpus
containing both literal and paraphrased plagiarism cases with the size of 300. We
randomly choose 150 documents for each type of plagiarism to achieve more accurate
outcomes. SOM-based document clustering and retrieval are the two processes affected
by the change of any parameter and, hence, they are relatively performed again at each
change. It is also noticed that the performance is also slightly different with the same set
of parameters as we can see later in the following sections.
4.4.1. Size of Term based Vocabulary π‘½π‘»πŸ
After performing term extraction, word stemming, stop word removal and concept
construction, we achieve the full term-based Vocabulary of 9868 distinct terms and the
full concept-based Vocabulary of 6815 distinct concepts. In experiment, we choose
different sizes of the term-based Vocabulary 𝑉1𝑇 to test its contribution to the
performance of our prototype. In addition, we keep other parameters consistent as
following: 𝑉1𝑐 size = 4000, T/C PCA dimensions = 200/200, SOM size = 8 x 9, SOM
training iterations = 150, 𝑀1 = 𝑀2 = 0.5.
Table 5 and Fig. 13 illustrate the performance of our system with different choices of
𝑉1𝑇 size. It can be seen that the size of 𝑉1𝑇 does not affect much the accuracy of Source
Detection and Retrieval. Precision/Recall fluctuates from 85.66% to 87.3%. However,
optimum parameter configuration can be achieved in this case at the size of around 6000.
π‘½π‘»πŸ Size
3000
4000
5000
6000
7000
8000
No of correct
retrieval
261/300
259/300
257/300
262/300
259/300
261/300
PR
0.87
0.863
0.8566
0.873
0.863
0.87
Table 5 - Performance based on different sizes of Term-based Vocabulary
25
100%
95%
PR
90%
85%
80%
75%
70%
3000
4000
5000
6000
7000
8000
Term-based Vocab Size
Figure 13 - Performance based on different sizes of Term based Vocabulary
4.4.2. Size of Concept based Vocabulary 𝑽π‘ͺ𝟏
In this section, we study the influence of different sizes of the Concept-based Vocabulary
𝑉1𝑐 on the system performance. For the size of the Term-based Vocabulary 𝑉1𝑇 , we choose
the optimum value from 4.4.1 𝑉1𝑇 = 6000. While the size of 𝑉1𝑐 is vary, other parameters
are kept as the same as in 4.4.1 ( 𝑉1𝑇 size = 6000, T/C PCA dimensions = 200/200, SOM
size = 8 x 9, SOM training iterations = 150, 𝑀1 = 𝑀2 = 0.5).
The results are documented in the following Table 6 and Fig. 14. Similarly, the change of
the 𝑉1𝑐 size does not significantly change the accuracy of candidate retrieval. The highest
value of PR is 88% corresponding to the 𝑉1𝑐 size of 6000 for the corpus size of 300.
𝑽π‘ͺ𝟏 Size
2000
3000
4000
5000
6000
No of correct
retrieval
259/300
260/300
254/300
259/300
264/300
PR
0.863
0.866
0.846
0.863
0.88
Table 6 - Performance based on different sizes of Concept based Vocabulary
26
100%
95%
PR
90%
85%
80%
75%
70%
2000
3000
4000
5000
6000
Concept-based Vocab Size
Figure 14 - Performance based on different sizes of Concept based Vocabulary
4.4.3. Dimensions of Term based PCA feature
For the experiments on different dimensions of Term-based PCA feature, we set the
parameters as in Section 4.4.2 except for the Concept-based Vocabulary 𝑉1𝑐 size. We
choose the 𝑉1𝑐 size of 6000 which provides the best PR in earlier section. The summary of
involved parameters is as the following ( 𝑉1𝑇 size = 6000, 𝑉1𝑐 size = 6000, Concept PCA
dimensions = 200, SOM size = 8 x 9, SOM training iterations = 150, 𝑀1 = 𝑀2 = 0.5).
In this study, we can see clearer trend comparing to the studies in 4.4.1 and 4.4.2. It is
observed from the results (Table 7 and Fig. 15) that the change in the dimensions of
Term-based PCA feature can significantly influence the performance of the C-PCA-SOM
model. Specifically, PR increases from 83.6% to 88% while the dimensions rise from 50
to 200. However, PR drops sharply from 250 onward (more than 60%). Basing on the
study, it is clarified that it is unnecessary to use all terms to build the Term-based
Vocabulary because it might introduce “noisy” features to system performance.
27
Term based
PCA dimensions
50
100
150
200
250
300
No of correct
retrieval
251/300
260/300
264/300
264/300
83/300
87/300
PR
0.836
0.866
0.88
0.88
0.276
0.29
Table 7 - Performance based on different dimensions of Term based PCA feature
100%
90%
80%
PR
70%
60%
50%
40%
30%
20%
50
100
150
200
250
300
Dimensions of Term-based PCA feature
Figure 15 - Performance based on different dimensions of Term based PCA feature
4.4.4. Dimensions of Concept based PCA feature
The parameters for the experiments on different dimensions of Concept based PCA
feature are set as the following (𝑉1𝑇 size = 6000, 𝑉1𝑐 size = 6000, Term PCA dimensions =
150, SOM size = 8 x 9, SOM training iterations = 150, 𝑀1 = 𝑀2 = 0.5). The only
parameter, which is modified, is the dimensions of the Term-based PCA feature which is
set as 150 (providing the best retrieval result in 4.4.3).
Table 8 and Fig. 16 summarize the experiment outcomes on variety Concept-based PCA
feature dimensions. Different from the Term-based PCA feature, the increase of Conceptbased PCA feature dimensions can sometimes slightly enhance or decrease the
performance of the C-PCA-SOM system on Source Detection and Retrieval. It is reported
28
from the diagram that PR fluctuate from over 80% to nearly 90%. The highest PR of 89%
can be achieved with the Concept-based PCA dimensions of 250.
Concept based
PCA dimensions
50
100
150
200
250
300
No of correct
retrieval
242/300
260/300
255/300
253/300
267/300
256/300
PR
0.806
0.866
0.85
0.843
0.89
0.853
Table 8 - Performance based on different dimensions of Concept based PCA feature
100%
95%
PR
90%
85%
80%
75%
70%
50
100
150
200
250
300
Dimensions of Concept-based PCA feature
Figure 16 - Performance based on different dimensions of Concept based PCA feature
4.4.5. Contribution of the Weights π’˜πŸ and π’˜πŸ
Finally, to investigate the contribution of Term- and Concept-based features, we study the
usage of different values of 𝑀1 and 𝑀2 corresponding to the assigned “degree of
significance” of Terms and Concepts. For parameter configuration, we only modify the
dimensions of Concept-based PCA feature. It is set to 250 which produce the best result
in 4.4.4. The summary of parameters is as the following (𝑉1𝑇 size = 6000, 𝑉1𝐢 size = 6000,
T/C PCA dimensions = 150/250, SOM size = 8 x 9, SOM iterations = 150).
29
Table 9 provides the summary of Source Detection and Retrieval results. It can be seen
that (𝑀1 = 1.0, 𝑀2 = 0.0) and (𝑀1 = 0.0, 𝑀2 = 1.0) are 2 special cases. The former is the
same as the PCA-SOM model without utilizing Concept-based feature. The later only
uses Concept-based feature for similarity calculation, i.e. Term-based feature is ignored.
Solely using 2 types of feature independently can also achieve satisfactory performance
of 87.6% and 84% of PR. However, it is noticed that the combination of Term- and
Concept-based features can produce better results by appropriate configuration of the
weights 𝑀1 and 𝑀2 . It is observed from the case of the corpus size of 300 that (𝑀1 = 0.4,
𝑀2 = 0.6) gives the highest PR of 88.3%. From this study, it is proved that Concept-based
feature can be used to improve Document Representation, Document Clustering,
Document Retrieval and, potentially, Plagiarism Detection.
In conclusion, even though VSM and PCA can provide better results than PCA-SOM and
C-PCA-SOM, they are unpractical for large datasets. The 2 SOM-based models can be
applied for real-time DR and PR due to constant processing time. In addition, C-PCASOM with additional Concept-based feature can achieve better performance by
appropriately balancing the significance between Term- and Concept-based features.
π’˜πŸ
(Term)
π’˜πŸ
(Concept)
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
No of correct
retrieval
263/300
262/300
261/300
265/300
254/300
252/300
PR
0.876
0.873
0.87
0.883
0.846
0.84
Table 9 - Performance based on different values of π’˜πŸ and π’˜πŸ
30
Chapter 5 – Conclusion and Future Works
The concept-based tree structure representation and the C-PCA-SOM 3-stage prototype
for Paraphrased PD are introduced in this study. By exploiting the ontology – WordNet –
to construct concept-based feature to enhance document representation, the modified
structure can capture not only syntactic but also semantic information of a document. As
a result, the task of Document Clustering can group more semantic-related documents
and Document Retrieval can retrieve more relevant candidates mentioning similar topics.
We conduct empirical experiments to test our system against Source Detection and
Retrieval. The results we achieve from experiments prove that our prototype can produce
competitive performance comparing with other systems. Even though VSM and PCA are
better in case of single source detection and retrieval, they are unpractical to apply for
large datasets. On the other hand, our prototype can be applied for real-time applications.
Furthermore, by setting up appropriate parameters, the performance of C-CPA-SOM
model can be relatively improved.
In the works carried out in future, we will focus on performing experiments on Stage 3 of
the prototype to verify the contribution of the Concept-based feature for the task of
Paraphrased Plagiarism Detection. In addition, we plan to study and apply another feature
called the Category-based feature to further enhance the concept-modified tree
representation. According to [26], the Category-based feature extracting from another
type of background knowledge – Wikipedia – can be used to improve the task of
Document Clustering. Therefore, part of our future works is to investigate the Categorybased feature and the application of Wikipedia for speeding up the process of DC and DR.
Eventually, the automatic configuration of different parameters such as weights of
different types of feature for optimum performance also need to be studied in details.
31
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
R. Lukashenko, V. Graudina, and J. Grundspenkis, "Computer-based plagiarism
detection methods and tools: an overview," in Proceedings of the 2007
international conference on Computer systems and technologies, Bulgaria, 2007,
pp. 1-6.
T. W. S. Chow and M. K. M. Rahman, "Multilayer SOM With Tree-Structured
Data for Efficient Document Retrieval and Plagiarism Detection," Neural
Networks, IEEE Transactions on, vol. 20, pp. 1385-1402, Sept. 2009.
S. M. Alzahrani, N. Salim, and A. Abraham, "Understanding Plagiarism
Linguistic Patterns, Textual Features, and Detection Methods," Systems, Man, and
Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42,
pp. 133-149, March 2012.
S. Brin, J. Davis, and H. Garcia-Molina, "Copy detection mechanisms for digital
documents," SIGMOD Rec., vol. 24, pp. 398-409, May 1995.
N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for
Digital Documents," in 2nd International Conference in Theory and Practice of
Digital Libraries (DL 1995), Austin, Texas, 1995.
C. Grozea, C. Gehl, and M. Popescu, "ENCOPLOT: Pairwise sequence matching
in linear time applied to plagiarism detection," in Proc. SEPLN, Donostia, Spain,
2009, pp. 10-18.
R. Yerra and Y.-K. Ng, "A Sentence-Based Copy Detection Approach for Web
Documents," in Fuzzy Systems and Knowledge Discovery. vol. 3613, L. Wang
and Y. Jin, Eds., ed: Springer Berlin / Heidelberg, 2005, pp. 481-482.
J. Koberstein and Y.-K. Ng, "Using Word Clusters to Detect Similar Web
Documents," in Knowledge Science, Engineering and Management. vol. 4092, J.
Lang, F. Lin, and J. Wang, Eds., ed: Springer Berlin / Heidelberg, 2006, pp. 215228.
A. Schenker, M. Last, H. Bunke, and A. Kandel, "Classification of Web
documents using a graph model," in Document Analysis and Recognition, 2003.
Proceedings. Seventh International Conference on, 2003, pp. 240-244.
T. W. S. Chow, H. Zhang, and M. K. M. Rahman, "A new document
representation using term frequency and vectorized graph connectionists with
application to document retrieval," Expert Systems with Applications, vol. 36, pp.
12023-12035, March 2009.
M. K. M. Rahman and T. W. S. Chow, "Content-based hierarchical document
organization using multi-layer hybrid network and tree-structured features,"
Expert Systems with Applications, vol. 37, pp. 2874-2881, Sept. 2010.
H. Zhang and T. W. S. Chow, "A coarse-to-fine framework to efficiently thwart
plagiarism," Pattern Recognition, vol. 44, pp. 471-487, 2011.
H. Zhang and T. W. S. Chow, "A multi-level matching method with hybrid
similarity for document retrieval," Expert Systems with Applications, vol. 39, pp.
2710-2719, Feb. 2012.
S. Wold, K. Esbensen, and P. Geladi, "Principal component analysis,"
Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 37-52, 1987.
T. Kohonen, "The self-organizing map," Proceedings of the IEEE, vol. 78, pp.
1464-1480, 1990.
32
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
A. Si, H. V. Leong, and R. W. H. Lau, "CHECK: a document plagiarism detection
system," in Proceedings of the 1997 ACM symposium on Applied computing, San
Jose, California, United States, 1997, pp. 70-77.
L. Sindhu, B. B. Thomas, and S. M. Idicula, "A Study of Plagiarism Detection
Tools and Technologies," International Journal of Advanced Research In
Technology, vol. 1, pp. 64-70, 2011.
G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic
indexing," Commun. ACM, vol. 18, pp. 613-620, 1975.
J. Zobel and A. Moffat, "Exploring the similarity space," ACM SIGIR Forum, vol.
32, pp. 18-34, 1998.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman,
"Indexing by Latent Semantic Analysis," Journal of the American Society for
Information Science, vol. 41, pp. 391-407, 1990.
K. Lagus, "Text Retrieval Using Self-Organized Document Maps," Neural
Processing Letters, vol. 15, pp. 21-29, 2002.
S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, "WEBSOM – Self-organizing
maps of document collections," Neurocomputing, vol. 21, pp. 101-117, 1998.
N. Ampazis and S. Perantonis, "LSISOM — A Latent Semantic Indexing
Approach to Self-Organizing Maps of Document Collections," Neural Processing
Letters, vol. 19, pp. 157-173, April 2004.
A. Georgakis, C. Kotropoulos, A. Xafopoulos, and I. Pitas, "Marginal median
SOM for document organization and retrieval," Neural Networks, vol. 17, pp.
365-377, 2004.
X. Xue and Z. Zhou, "Distributional Features for Text Categorization,"
Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 428-442,
March 2009.
X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as
external knowledge for document clustering," in Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining,
Paris, France, 2009, pp. 389-396.
A. Hotho, S. Staab, and G. Stumme, "Ontologies improve text document
clustering," in Data Mining, 2003. ICDM 2003. Third IEEE International
Conference on, 2003, pp. 541-544.
S. Liu, F. Liu, C. Yu, and W. Meng, "An effective approach to document retrieval
via utilizing WordNet and recognizing phrases," in Proceedings of the 27th
annual international ACM SIGIR conference on Research and development in
information retrieval, Sheffield, United Kingdom, 2004, pp. 266-272.
J. Sedding and D. Kazakov, "WordNet-based text document clustering," in
Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural
Language Data, Geneva, 2004, pp. 104-113.
G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis, and E. E. Milios,
"Semantic similarity methods in wordNet and their application to information
retrieval on the web," in Proceedings of the 7th annual ACM international
workshop on Web information and data management, Bremen, Germany, 2005,
pp. 10-16.
33
[31]
[32]
[33]
[34]
[35]
[36]
[37]
G. Spanakis, G. Siolas, and A. Stafylopatis, "Exploiting Wikipedia Knowledge
for Conceptual Hierarchical Clustering of Documents," The Computer Journal,
vol. 55, pp. 299-312, March 2012.
S. Meyer zu Eissen, B. Stein, and M. Kulig, "Plagiarism detection without
reference collections," in Advances in data analysis, R. Decker and H.-J. Lenz,
Eds., ed Berlin, Heidelberg: Springer, 2007, pp. 359-366.
S. Eissen and B. Stein, "Intrinsic Plagiarism Detection," in Advances in
Information Retrieval. vol. 3936, M. Lalmas, A. MacFarlane, S. Rüger, A.
Tombros, T. Tsikrika, and A. Yavlinsky, Eds., ed: Springer Berlin / Heidelberg,
2006, pp. 565-569.
N. Shivakumar and H. Garcia-Molina, "Building a scalable and accurate copy
detection mechanism," in Proceedings of the first ACM international conference
on Digital libraries, Bethesda, Maryland, United States, 1996, pp. 160-168.
A. Barrón-Cedeno and P. Rosso, "On automatic plagiarism detection based on ngrams comparison," in Proc. 31st Eur. Conf. IR Res. Adv. Info. Retrieval, 2009,
pp. 696-700.
E. Stamatatos, "Plagiarism detection using stopword n-grams," Journal of the
American Society for Information Science and Technology, vol. 62, pp. 25122527, Sept. 2011.
G. A. Miller, "WordNet: a lexical database for English," Commun. ACM, vol. 38,
pp. 39-41, 1995.
34
Download