- Covenant University

advertisement
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
27
Development of iCU:
A Plagiarism Detection Software
Aderemi A. Atayero, Adeyemi A. Alatishe and Kanyinsola O. Sanusi
Department of Electrical & Information Engineering
Covenant University, Ota, Nigeria
atayero@ieee.org

Abstract— In this paper, we present the design, development and
deployment of iCU – a plagiarism detection software. iCU is a
software system designed for the detection of plagiarized works.
It comprises three main modules with different functions. The
first module named file comparison module is designed for use
only when there is a contention about two documents that have
exactly or nearly the same content. The second module is named
the similarity detection module, it is used when an individual is
almost certain of the source of a text and works like a search
engine. While the third module named global comparison module
is meant for checking the extent to which a given piece of work is
similar in content to already published works. The developed
software works well both online and offline and is easily
deployable in an academic environment.
Suffice it to note here that the act of building on previous
works in itself is not wrong; it is the lack of (or insufficient)
attribution that is frowned at. As the theme of the popular
search engine Google Scholar rightly says, it is necessary to
“stand on the shoulders of giants”1, in order to advance
knowledge. But the giants must be adequately referenced and
their previous works attributed. There is no wrong in accessing
information freely, but the technology also should provide
security to each document from being illegally copied by a
lazy person. One approach that could be used to address this
issue is to provide a copy detection system to which legal
original documents could be registered, and copies can be
detected [1].
Keywords – Plagiarism; plagiarism detection; similarity check;
copy detection;Turnitin; MOSS; Jplag .
Most plagiarism detection systems are either universal (that
is, can process text documents of any nature) or specially
designed to detect plagiarism in a particular context e.g. in
source code files. The methodology adopted in the
development of iCU is limited only to textual search.
Enterprise level plagiarism check software (such as Turnitin),
would normally compare input documents against very large
databases. The iCU software uses the free-access google
search engine database for its comparison check interface. The
implication of this is that it wouldn’t bring out as much results
as a well-known plagiarism detector like Turnitin would do. In
most cases, plagiarism is intentional and deliberately harmful
in nature. But on the other hand, it might as well be the result
of ignorance, which could be avoided if a researcher had better
understanding of the nature of plagiarism. Plagiarism is to
academia what piracy is to the entertainment industry. In the
fight against plagiarism, it can either be proactively prevented,
by educating students and researcher alike on its consequences,
or detected after the fact. There are two methods of plagiarism
detection: manual detection and computer-aided detection.
Since detecting plagiarism manually can be hard even for a
skilled teacher, this work is focused on the latter, by examining
already-existing tools for detecting plagiarism. There are
numerous plagiarism detection systems, however, not all of
them implement completely new methods and algorithms.
I.
INTRODUCTION
Plagiarism is essentially the attempt to claim credit for the
intellectual property of another person. Such intellectual
property maybe written ideas, texts, programs etc or even
unwritten idea. It is considered a very serious offence
especially in the academia, since it compromises academic
integrity. The line between plagiarism and research is
becoming thinner with the information deluge of the 21st
century. The effort in the academia to prevent the rampant
occurrence of plagiarism is a matter of expediency. The
common causes reported in the literature for its ever increasing
occurrence are: a) insufficient time for preparation for
assignments and homework, which could be a consequence of
excessive workload in school curricula; b) lack of time to
understand the subject matter on the part of the researcher,
which could result from various factors; c) giving assignments
to students that are out of their areas of competence or for
which they have not been adequately prepared; and d) laziness
on the part of the researcher to actually read, comprehend and
represent source ideas in a manner that does not border on
plagiarism.
Aderemi A. Atayero, Adeyemi A. Alatishe and Kanyinsola O. Sanusi are with
the Department of Electrical & Information Engineering, Covenant University,
PMB1023
Ota,
Nigeria.
(phone:
+234.807.886.6304;
e-mail:
atayero@ieee.org).
II.
PREVIOUS WORKS
1
A paraphrase of Sir Isaac Newton’s popular quote: “If I have seen further it is
only by standing on the shoulders of giants.”
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
Fingerprint-Based Systems
A. Types of Plagiarism
There are different types of plagiarism, which include, but
are not limited to the following:
1.
28
Unauthorized or unacknowledged collaborative work.
This is when the same or similar phrases, quotations, sentences
or parallel constructions appear in two or more pages on the
same topic. This can be avoided by acknowledging in a
footnote or endnote any significant discussions, advice,
comments or suggestions the student might have received from
or had with others.
2. Attempting to pass off a whole or part of a work
belonging to another person or group as your own work.
This includes borrowing, coping, buying, receiving,
downloading, taking, using and even stealing a paper that is
not your own. Any text taken directly from another source
including course textbooks must be quoted using quotation
marks, and such material must be cited.
3. Usage of improperly paraphrased text without
referencing.
According to Harbrace College Handbook, a paraphrase is a
statement of a source in about the same number of words. To
paraphrase correctly, the work must contain a distinctly
different idea, that is, the paraphrase must contain the person’s
original idea [2]. Changing the word order or sentence
structure, deleting words or phrases or substituting synonyms
is not enough if the original author’s idea remains unchanged
in the text. The individual could quote or cite the text if there’s
difficulty paraphrasing it. Improper paraphrasing often results
from the use of a single source. In such cases, it is difficult to
separate one’s own ideas from those of the author or translator.
4. The use of any amount of text that is improperly
paraphrased, but which is either not cited or which is
improperly cited. The following should be documented: a)
direct quotations, b) paraphrased views of others, c)
uncommon knowledge, d) Information that is not commonly
accepted and e) tables, charts, graphs, statistics taken directly
or paraphrased from a source.
Common Knowledge
As the name implies, common knowledge means anything
that is generally known to everyone. An example of common
knowledge is historical information like ‘Independence Day’
of a particular country. It could be a well-known saying. If
unsure whether a piece of information is common knowledge,
then it should rather be cited. For as the saying goes: Better
safe than sorry. Which by the way is a good example of
common knowledge.
B. Classification Of Plagiarism Detection Systems
Basically, there is no single criterion for performing
classification. Most hermetic systems are either universal (that
is, can process text documents of any nature) or specially
designed to detect plagiarism in source code files. The figure
below shows the classification of plagiarism detection systems
based on algorithm.
The main idea of fingerprinting is to create fingerprints for
all documents in a collection. Fingerprint is short sequence of
bytes that characterizes a longer file. For instance, fingerprints
can be obtained by applying any hash function to a file. In
plagiarsim detection systems, fingerprints are more advanced
than simple hash codes. Nowadays, it is generally believed that
attribute counting is inferior to content comparison, since even
small modifications can greatly affect fingerprints. As a result,
later systems usually do not follow this technique, but there are
several recent systems that combine fingerprinting with
elements of string matching, for example, MOSS program.
Content Comparison Techniques
These are the building blocks of majority of the present
plagiairsm detection systems. There are different algorithms
aimed at file-file comparison, varying in terms of speed,
memory requirements and expected reliability.
String Matching Based Content Comparison
String matching based methods compare files by
treating them as strings. It usually does not take into account
the hierarchical structure of the computer program, considering
it as raw data.
Parse Trees Comparison
A parse tree is an ordered or rooted tree that represents
the syntactic structure of a string according to some formal
grammar [3]. Parse trees may be generated for sentences in
natural language as well as during processing of computer
languages, such as programming languages. Natural language
texts are divided into sections, subsections, paragraphs and
sentences, while source code files contain classes, functions,
logic blocks and control structures. Though this approach
seems to be the most advanced, little research in this area has
been carried out so far. For example, it is still unknown how
such a complex analysis of input files influences the final
results, that is it is undiscovered whether parse trees One
approach to address the issue of plagiarism is to provide copy
detection systems to which legal original documents are
registered and copies are detected. Some of the alreadyexisting plagiarism detection tools include the following:
ICheck, JPlag, MOSS and Turnitin.
A. iCheck Plagiarism Detection Tool
Algorithm: iCheck uses an algorithm known as approximate
phrase matching algorithm. Phrase matching involves/requires
close proximity to detect any similarity in header and
paragraphs of submitted documentation. A header is a line of
text which serves to indicate what the passage below it is about.
The algorithm allows pattern matching in order to search for
proximity result. It finds all occurrences of the pattern in
sentences with at most k differences and searches only a
character of the pattern corresponding to a different character of
text. But this algorithm ignores all errors while ‘scanning’ the
document or text [4].
Limitation: The system works like a similarity checker because
the user has to copy and paste sentences in a textbox, it does not
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
involve global comparison. It should be noted that iCheck is
deals with text-document rather than source code.
B. Jplag Plagiarism Detection Tool
Jplag is a web service that finds pairs of similar programs
among a given set of programs. It supports such languages as
C and C++. Jplag is written in Java language and analyzes
programs source text written in Java, Scheme, C or C++.
Algorithm: Jplag uses an algorithm known as the Greedy
String Tiling algorithm. It operates in two phases which are:
a) All programs to be compared are parsed (or scanned,
depending on the input language) and converted into token
strings. This means that two strings are searched for the
biggest contiguous matches.
b) The token strings are compared in pairs for determining the
similarity of each pair. The method used is basically “Greedy
String Tiling”. This simply means that every token will only be
used in one match, and all matches of maximal length found in
the first phase are marked [5].
During each comparison, Jplag attempts to cover one token
string with substrings (tiles) taken from the other as well as
possible. The percentage of the token strings that can be
covered is the similarity value. The corresponding tiles are
visualized in the HTML pages. What is a parser? A parser is a
computer program that divides code up into functional
components.
Limitation: Jplag plagiarism detection tool can only be used
for source codes and not text-documents.
C. MOSS
MOSS stands for Measure of Software Similarity. MOSS is a
widely-used plagiarism detection service, over the internet
since 1997. It is a free online copy detection tool. MOSS is
primarily used for detecting plagiarism in programming
assignments in computer science and other engineering
courses, though several text formats are supported as well.
Algorithm: MOSS is based on a string-matching algorithm that
functions by dividing programs into k-grams, where a k-gram
is a contiguous substring of length k. Contiguous means very
close or connected in space or time. Each k-gram is hashed
and MOSS selects a subset of these hash values as the
program’s fingerprints. Similarity is determined by the number
of fingerprints shared by the programs, and the more
fingerprints they share, the more similar they are [6].
D. Turnitin
increased efficiency and flexibility, the instructor is able to
identify problems that might be common to the entire class,
allowing future lessons to be tailored toward the elimination of
such problems.
Algorithm: In the Turnitin algorithm detection of identical
string of words does not mean the work is plagiarized. It may
be that the string of words has been correctly referenced or
cited. On the other hand a string of words or sections of an
assignment identical to a document in the database that is not
referenced or cited correctly may be examined further by the
professor/instructor for evidence of plagiarism.
Limitation: Turnitin can only detect the most blatant copied
text. It cannot detect cleverly paraphrased passages, or copied
text that has been greatly altered by the student’s use of a
thesaurus.
Conclusion: Plagiarism checkers usually work along the same
scheme: they compare the given document to the ones
available online and attempt to match the parts of the
document with the general base of the available texts. Each
plagiarism checker has its own algorithm of text comparison
and so they can detect plagiarism at different levels. Plagiarism
detection could be based on propriety database, the internet,
search engines, etc.
In developing software to detect plagiarism, the following need
to be put into consideration.
 Time of analysis: speed with which a system performs the
required check and presents the results. This factor must be
analyzed in relation with the other ones, as quick but
inefficient check is useless.
 Document
capacity:
consists
on
how
many
documents/words the plagiarism detection system can
process per unit of time.
 Detection intensity: defines how often and what volume of
text is checked in search engines within the detection time
scope (for example, every ten words are checked or only
paragraphs).
 Comparison algorithm: methodology employed for
comparison and the results are shown in the plagiarism
report.
 Precision: Main characteristic that assesses the plagiarism
check precision. The more documents are correctly flagged
as plagiarized, the better the system’s quality.
III.
Turnitin operates in three modules which include: originality
check, peer review and grade mark. It brings out the result in
form of an originality report. Turnitin operates in three (3)
modules which include: the originality check, peer review and
the grade mark. For the originality check, Turnitin accepts
students’ papers, checks for originality and brings out the
result. Peer review is a feature of the Turnitin service by which
the instructor can create assignments that incorporate
collaborative learning environment through which students are
encouraged to evaluate the writing decisions made by others
[7]. It is a paperless grading system that, aside from providing
29
ICU DESIGN METHODOLOGY
A. Design Approach
The waterfall model was used as the modeling approach for
iCU, while the bottom-up approach was the choice design
approach. The bottom-up refers to a style of approach where
an application is constructed starting with existing individual
systems of the application, and constructing gradually more
and more complicated features, until all of the application has
been written or programmed [3].
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
30
B. Use Case Diagram
Figure 5: iCU file comparison interface
Figure 1: The Use Case Diagram of the Software
B. Similarity Detection Interface
The screenshot of this interface is shown in Figure 6. In this
sub-module of the software, the user opens a file, selects some
text from the document and pastes it in the text box shown.
Once the system is connected to the Internet and the user clicks
the button ‘Check Duplicate Content’, the software connects to
the database integrated generates results. The function of the
C. Use Case Narratives
Figure 1 shows the use case diagram of iCU. The software
was implemented in three sub-modules namely: i) Global
comparison, ii) Similarity detection and iii) File comparison.
The use case narratives for each of the three sub-modules are
presented in Tables 1 of the appendix.
D. Coding Implementation
There are three different sub modules of this software. They
include the following:
1.
2.
3.
The file comparison module.
The similarity detection module.
The global comparison module.
The Integrated Development Environment (IDE) used was
visual studio. In these three sub-modules, the same
programming language was used. The main programming
language was C#, while ASP.Net and HTML were also used
along the line.
Figure 6: iCU similarity detection interface
IV.
ICU SOFTWARE INTERFACE
A. File Comparison Interface
Figure 5 shows the file comparison interface. Here, the user is
selects two files stored on the system and click the button
‘Check Duplicate Content’. On clicking this, the software
compares the two files line-by line, and brings out the result in
three statements which include the number of lines that are
similar in both files, the number of lines that are not the same
in both files, and the number of lines that are in the first file
selected and are not in the second file.
box labeled ‘Add quotes’ is for the software to search the
database for exactly the same text as the one pasted in the
textbox.
C. Global Comparison Interface
The screenshot of this interface is in Figure 7. The user selects
a file and clicks on the button ‘Compare Globally’. The result
comes out showing the related websites and percentage of
plagiarism.
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
31
plagiarism check function. The developed iCU software passed
all of the tests mentioned above.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Figure 7: iCu global comparison interface
R. Olt, “A new design on plagiarism: developing an industructional
design model to deter plagairism in online courses,” unpublished.
Harbrace College Handbook, available @ http://bit.ly/jiJcsS, accessed
2011.06.26, p.255
J. Rumbaugh et al., “Object-Oriented Modeling and Design,” Prentice
Hall Publishing, 1991.
R. Baeza-Yates, G. H. Gonnet, “A new approach to text
searching:communications of the ACM,” vol. 35, pp. 74-82.
Berghel And Sallach, Measurements Of Program Similarity In Identical
Task Environments, pp. 65-76, 1984.
A. Si, H. V. Leong, R. W. H. Lau, “A document plagiarism detection
system. Department of Computing,” The Symposium on Applied
Computing, Proceedings of the 1997 ACM symposium on Applied
computing, pp. 70 – 77, 1997.
G. Brumel, “Physicist found guilty of misconduct,” pp. 419-421, Sept.
2002.
T. Reenskaug, “Working with Objects: The OORam Software
Engineering Method,” Manning, 1995.
A. Software Testing
The developed software was tested using standard software
testing procedure. This process validates that a software
application meets the business and technical requirements that
guided its design and development, and works as expected [8].
It focuses on finding defects in the final product. The
following tests were carried out on iCU: 1) Unit test, 2)
System test, 3) Integration Test, 4) User Acceptance Test, 5)
Regression Test.
Unit Test: This tests if individual units of code are fit for use. It
entails individual testing of each unit of code for functionality
before their integration. It helps to detect coding problems
early in the development cycle.
System Test: This test validates and verifies the functional
design specification.
Integration Test: Ensures that the integrated sub-modules work
together properly.
User Acceptance Test: UAT is a test run for a customer to
demonstrate functionality [5]. This test was carried out by a
user to test the software, and to a large extent the user was
satisfied.
Regression Test: Regression testing means rerunning test cases
from existing test suites to build confidence that software
changes have no unintended side effects. After having finished
working on the software, some changes were made, hence this
test was carried out and the software passed in the test.
V. CONCLUSIONS
This paper has presented the development of iCU a software
for detection of plagiarism in written text files. The developed
software was implemented in C# using a modular approach. It
comprises of three sub-modules each performing a different
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04
32
APPENDIX
TABLE 1. USE CASE NARRATIVE FOR ICU SUB-MODULES
Narrative
Global Comparison
Similarity Detection
File Comparison
Brief
Description
This use case describes how
the user operates this part of
the project.
This use case describes how
the actor can use the project
as a search engine.
The actor uses this section to
ascertain
the
case
of
plagiarism in the case of an
already-existing contention.
Actor
Lecturer.
Lecturer
Lecturer.
Flowof Events
Basic Flow
Basic Flow
Basic Flow
This use case begins when
the actor/user decides to
implement this section of the
project.
1. The actor clicks the button
labeled
‘compare
files
online’.
1. The actor clicks the button
and the page appears.
1. The actor clicks on the
global comparer button.
2. The page appears.
3. The actor is required to
choose/select a file.
4. Once the actor selects a
file and clicks the button
‘compare files’, the system
connects to an online server
and compares the file with
other files on the database of
the server it has connected to.
2. The actor opens a file and
selects some part of it.
3. The actor copies the text
into the textbox shown and
clicks ‘compare files’.
2. The actor selects the two
files to be compared.
3. After this, the actor clicks
the button ‘compare files’.
4. The system just like in the
case of the global comparer
connects to an online server
and searches for documents
that are exactly the same or
closely related to the selected
text.
Special
Requirements
Internet and web browser.
Internet and browser.
None
Pre-conditions
None.
None
There should be alreadysaved documents on the
system.
Post-condition
The result brings out a
collection of websites and
documents that are very
similar to the file the actor
has uploaded.
After the actor uploads the
selected text, the result brings
as output websites that have
exactly the same or closelyrelated text to the one
selected from the file.
The result brings out the
percentage
of
similarity
between the two files.
Extension Points
None.
None.
None.
111204-3838 IJECS-IJENS © August 2011 IJENS
IJENS
Download