International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 27 Development of iCU: A Plagiarism Detection Software Aderemi A. Atayero, Adeyemi A. Alatishe and Kanyinsola O. Sanusi Department of Electrical & Information Engineering Covenant University, Ota, Nigeria atayero@ieee.org Abstract— In this paper, we present the design, development and deployment of iCU – a plagiarism detection software. iCU is a software system designed for the detection of plagiarized works. It comprises three main modules with different functions. The first module named file comparison module is designed for use only when there is a contention about two documents that have exactly or nearly the same content. The second module is named the similarity detection module, it is used when an individual is almost certain of the source of a text and works like a search engine. While the third module named global comparison module is meant for checking the extent to which a given piece of work is similar in content to already published works. The developed software works well both online and offline and is easily deployable in an academic environment. Suffice it to note here that the act of building on previous works in itself is not wrong; it is the lack of (or insufficient) attribution that is frowned at. As the theme of the popular search engine Google Scholar rightly says, it is necessary to “stand on the shoulders of giants”1, in order to advance knowledge. But the giants must be adequately referenced and their previous works attributed. There is no wrong in accessing information freely, but the technology also should provide security to each document from being illegally copied by a lazy person. One approach that could be used to address this issue is to provide a copy detection system to which legal original documents could be registered, and copies can be detected [1]. Keywords – Plagiarism; plagiarism detection; similarity check; copy detection;Turnitin; MOSS; Jplag . Most plagiarism detection systems are either universal (that is, can process text documents of any nature) or specially designed to detect plagiarism in a particular context e.g. in source code files. The methodology adopted in the development of iCU is limited only to textual search. Enterprise level plagiarism check software (such as Turnitin), would normally compare input documents against very large databases. The iCU software uses the free-access google search engine database for its comparison check interface. The implication of this is that it wouldn’t bring out as much results as a well-known plagiarism detector like Turnitin would do. In most cases, plagiarism is intentional and deliberately harmful in nature. But on the other hand, it might as well be the result of ignorance, which could be avoided if a researcher had better understanding of the nature of plagiarism. Plagiarism is to academia what piracy is to the entertainment industry. In the fight against plagiarism, it can either be proactively prevented, by educating students and researcher alike on its consequences, or detected after the fact. There are two methods of plagiarism detection: manual detection and computer-aided detection. Since detecting plagiarism manually can be hard even for a skilled teacher, this work is focused on the latter, by examining already-existing tools for detecting plagiarism. There are numerous plagiarism detection systems, however, not all of them implement completely new methods and algorithms. I. INTRODUCTION Plagiarism is essentially the attempt to claim credit for the intellectual property of another person. Such intellectual property maybe written ideas, texts, programs etc or even unwritten idea. It is considered a very serious offence especially in the academia, since it compromises academic integrity. The line between plagiarism and research is becoming thinner with the information deluge of the 21st century. The effort in the academia to prevent the rampant occurrence of plagiarism is a matter of expediency. The common causes reported in the literature for its ever increasing occurrence are: a) insufficient time for preparation for assignments and homework, which could be a consequence of excessive workload in school curricula; b) lack of time to understand the subject matter on the part of the researcher, which could result from various factors; c) giving assignments to students that are out of their areas of competence or for which they have not been adequately prepared; and d) laziness on the part of the researcher to actually read, comprehend and represent source ideas in a manner that does not border on plagiarism. Aderemi A. Atayero, Adeyemi A. Alatishe and Kanyinsola O. Sanusi are with the Department of Electrical & Information Engineering, Covenant University, PMB1023 Ota, Nigeria. (phone: +234.807.886.6304; e-mail: atayero@ieee.org). II. PREVIOUS WORKS 1 A paraphrase of Sir Isaac Newton’s popular quote: “If I have seen further it is only by standing on the shoulders of giants.” 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 Fingerprint-Based Systems A. Types of Plagiarism There are different types of plagiarism, which include, but are not limited to the following: 1. 28 Unauthorized or unacknowledged collaborative work. This is when the same or similar phrases, quotations, sentences or parallel constructions appear in two or more pages on the same topic. This can be avoided by acknowledging in a footnote or endnote any significant discussions, advice, comments or suggestions the student might have received from or had with others. 2. Attempting to pass off a whole or part of a work belonging to another person or group as your own work. This includes borrowing, coping, buying, receiving, downloading, taking, using and even stealing a paper that is not your own. Any text taken directly from another source including course textbooks must be quoted using quotation marks, and such material must be cited. 3. Usage of improperly paraphrased text without referencing. According to Harbrace College Handbook, a paraphrase is a statement of a source in about the same number of words. To paraphrase correctly, the work must contain a distinctly different idea, that is, the paraphrase must contain the person’s original idea [2]. Changing the word order or sentence structure, deleting words or phrases or substituting synonyms is not enough if the original author’s idea remains unchanged in the text. The individual could quote or cite the text if there’s difficulty paraphrasing it. Improper paraphrasing often results from the use of a single source. In such cases, it is difficult to separate one’s own ideas from those of the author or translator. 4. The use of any amount of text that is improperly paraphrased, but which is either not cited or which is improperly cited. The following should be documented: a) direct quotations, b) paraphrased views of others, c) uncommon knowledge, d) Information that is not commonly accepted and e) tables, charts, graphs, statistics taken directly or paraphrased from a source. Common Knowledge As the name implies, common knowledge means anything that is generally known to everyone. An example of common knowledge is historical information like ‘Independence Day’ of a particular country. It could be a well-known saying. If unsure whether a piece of information is common knowledge, then it should rather be cited. For as the saying goes: Better safe than sorry. Which by the way is a good example of common knowledge. B. Classification Of Plagiarism Detection Systems Basically, there is no single criterion for performing classification. Most hermetic systems are either universal (that is, can process text documents of any nature) or specially designed to detect plagiarism in source code files. The figure below shows the classification of plagiarism detection systems based on algorithm. The main idea of fingerprinting is to create fingerprints for all documents in a collection. Fingerprint is short sequence of bytes that characterizes a longer file. For instance, fingerprints can be obtained by applying any hash function to a file. In plagiarsim detection systems, fingerprints are more advanced than simple hash codes. Nowadays, it is generally believed that attribute counting is inferior to content comparison, since even small modifications can greatly affect fingerprints. As a result, later systems usually do not follow this technique, but there are several recent systems that combine fingerprinting with elements of string matching, for example, MOSS program. Content Comparison Techniques These are the building blocks of majority of the present plagiairsm detection systems. There are different algorithms aimed at file-file comparison, varying in terms of speed, memory requirements and expected reliability. String Matching Based Content Comparison String matching based methods compare files by treating them as strings. It usually does not take into account the hierarchical structure of the computer program, considering it as raw data. Parse Trees Comparison A parse tree is an ordered or rooted tree that represents the syntactic structure of a string according to some formal grammar [3]. Parse trees may be generated for sentences in natural language as well as during processing of computer languages, such as programming languages. Natural language texts are divided into sections, subsections, paragraphs and sentences, while source code files contain classes, functions, logic blocks and control structures. Though this approach seems to be the most advanced, little research in this area has been carried out so far. For example, it is still unknown how such a complex analysis of input files influences the final results, that is it is undiscovered whether parse trees One approach to address the issue of plagiarism is to provide copy detection systems to which legal original documents are registered and copies are detected. Some of the alreadyexisting plagiarism detection tools include the following: ICheck, JPlag, MOSS and Turnitin. A. iCheck Plagiarism Detection Tool Algorithm: iCheck uses an algorithm known as approximate phrase matching algorithm. Phrase matching involves/requires close proximity to detect any similarity in header and paragraphs of submitted documentation. A header is a line of text which serves to indicate what the passage below it is about. The algorithm allows pattern matching in order to search for proximity result. It finds all occurrences of the pattern in sentences with at most k differences and searches only a character of the pattern corresponding to a different character of text. But this algorithm ignores all errors while ‘scanning’ the document or text [4]. Limitation: The system works like a similarity checker because the user has to copy and paste sentences in a textbox, it does not 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 involve global comparison. It should be noted that iCheck is deals with text-document rather than source code. B. Jplag Plagiarism Detection Tool Jplag is a web service that finds pairs of similar programs among a given set of programs. It supports such languages as C and C++. Jplag is written in Java language and analyzes programs source text written in Java, Scheme, C or C++. Algorithm: Jplag uses an algorithm known as the Greedy String Tiling algorithm. It operates in two phases which are: a) All programs to be compared are parsed (or scanned, depending on the input language) and converted into token strings. This means that two strings are searched for the biggest contiguous matches. b) The token strings are compared in pairs for determining the similarity of each pair. The method used is basically “Greedy String Tiling”. This simply means that every token will only be used in one match, and all matches of maximal length found in the first phase are marked [5]. During each comparison, Jplag attempts to cover one token string with substrings (tiles) taken from the other as well as possible. The percentage of the token strings that can be covered is the similarity value. The corresponding tiles are visualized in the HTML pages. What is a parser? A parser is a computer program that divides code up into functional components. Limitation: Jplag plagiarism detection tool can only be used for source codes and not text-documents. C. MOSS MOSS stands for Measure of Software Similarity. MOSS is a widely-used plagiarism detection service, over the internet since 1997. It is a free online copy detection tool. MOSS is primarily used for detecting plagiarism in programming assignments in computer science and other engineering courses, though several text formats are supported as well. Algorithm: MOSS is based on a string-matching algorithm that functions by dividing programs into k-grams, where a k-gram is a contiguous substring of length k. Contiguous means very close or connected in space or time. Each k-gram is hashed and MOSS selects a subset of these hash values as the program’s fingerprints. Similarity is determined by the number of fingerprints shared by the programs, and the more fingerprints they share, the more similar they are [6]. D. Turnitin increased efficiency and flexibility, the instructor is able to identify problems that might be common to the entire class, allowing future lessons to be tailored toward the elimination of such problems. Algorithm: In the Turnitin algorithm detection of identical string of words does not mean the work is plagiarized. It may be that the string of words has been correctly referenced or cited. On the other hand a string of words or sections of an assignment identical to a document in the database that is not referenced or cited correctly may be examined further by the professor/instructor for evidence of plagiarism. Limitation: Turnitin can only detect the most blatant copied text. It cannot detect cleverly paraphrased passages, or copied text that has been greatly altered by the student’s use of a thesaurus. Conclusion: Plagiarism checkers usually work along the same scheme: they compare the given document to the ones available online and attempt to match the parts of the document with the general base of the available texts. Each plagiarism checker has its own algorithm of text comparison and so they can detect plagiarism at different levels. Plagiarism detection could be based on propriety database, the internet, search engines, etc. In developing software to detect plagiarism, the following need to be put into consideration. Time of analysis: speed with which a system performs the required check and presents the results. This factor must be analyzed in relation with the other ones, as quick but inefficient check is useless. Document capacity: consists on how many documents/words the plagiarism detection system can process per unit of time. Detection intensity: defines how often and what volume of text is checked in search engines within the detection time scope (for example, every ten words are checked or only paragraphs). Comparison algorithm: methodology employed for comparison and the results are shown in the plagiarism report. Precision: Main characteristic that assesses the plagiarism check precision. The more documents are correctly flagged as plagiarized, the better the system’s quality. III. Turnitin operates in three modules which include: originality check, peer review and grade mark. It brings out the result in form of an originality report. Turnitin operates in three (3) modules which include: the originality check, peer review and the grade mark. For the originality check, Turnitin accepts students’ papers, checks for originality and brings out the result. Peer review is a feature of the Turnitin service by which the instructor can create assignments that incorporate collaborative learning environment through which students are encouraged to evaluate the writing decisions made by others [7]. It is a paperless grading system that, aside from providing 29 ICU DESIGN METHODOLOGY A. Design Approach The waterfall model was used as the modeling approach for iCU, while the bottom-up approach was the choice design approach. The bottom-up refers to a style of approach where an application is constructed starting with existing individual systems of the application, and constructing gradually more and more complicated features, until all of the application has been written or programmed [3]. 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 30 B. Use Case Diagram Figure 5: iCU file comparison interface Figure 1: The Use Case Diagram of the Software B. Similarity Detection Interface The screenshot of this interface is shown in Figure 6. In this sub-module of the software, the user opens a file, selects some text from the document and pastes it in the text box shown. Once the system is connected to the Internet and the user clicks the button ‘Check Duplicate Content’, the software connects to the database integrated generates results. The function of the C. Use Case Narratives Figure 1 shows the use case diagram of iCU. The software was implemented in three sub-modules namely: i) Global comparison, ii) Similarity detection and iii) File comparison. The use case narratives for each of the three sub-modules are presented in Tables 1 of the appendix. D. Coding Implementation There are three different sub modules of this software. They include the following: 1. 2. 3. The file comparison module. The similarity detection module. The global comparison module. The Integrated Development Environment (IDE) used was visual studio. In these three sub-modules, the same programming language was used. The main programming language was C#, while ASP.Net and HTML were also used along the line. Figure 6: iCU similarity detection interface IV. ICU SOFTWARE INTERFACE A. File Comparison Interface Figure 5 shows the file comparison interface. Here, the user is selects two files stored on the system and click the button ‘Check Duplicate Content’. On clicking this, the software compares the two files line-by line, and brings out the result in three statements which include the number of lines that are similar in both files, the number of lines that are not the same in both files, and the number of lines that are in the first file selected and are not in the second file. box labeled ‘Add quotes’ is for the software to search the database for exactly the same text as the one pasted in the textbox. C. Global Comparison Interface The screenshot of this interface is in Figure 7. The user selects a file and clicks on the button ‘Compare Globally’. The result comes out showing the related websites and percentage of plagiarism. 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 31 plagiarism check function. The developed iCU software passed all of the tests mentioned above. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Figure 7: iCu global comparison interface R. Olt, “A new design on plagiarism: developing an industructional design model to deter plagairism in online courses,” unpublished. Harbrace College Handbook, available @ http://bit.ly/jiJcsS, accessed 2011.06.26, p.255 J. Rumbaugh et al., “Object-Oriented Modeling and Design,” Prentice Hall Publishing, 1991. R. Baeza-Yates, G. H. Gonnet, “A new approach to text searching:communications of the ACM,” vol. 35, pp. 74-82. Berghel And Sallach, Measurements Of Program Similarity In Identical Task Environments, pp. 65-76, 1984. A. Si, H. V. Leong, R. W. H. Lau, “A document plagiarism detection system. Department of Computing,” The Symposium on Applied Computing, Proceedings of the 1997 ACM symposium on Applied computing, pp. 70 – 77, 1997. G. Brumel, “Physicist found guilty of misconduct,” pp. 419-421, Sept. 2002. T. Reenskaug, “Working with Objects: The OORam Software Engineering Method,” Manning, 1995. A. Software Testing The developed software was tested using standard software testing procedure. This process validates that a software application meets the business and technical requirements that guided its design and development, and works as expected [8]. It focuses on finding defects in the final product. The following tests were carried out on iCU: 1) Unit test, 2) System test, 3) Integration Test, 4) User Acceptance Test, 5) Regression Test. Unit Test: This tests if individual units of code are fit for use. It entails individual testing of each unit of code for functionality before their integration. It helps to detect coding problems early in the development cycle. System Test: This test validates and verifies the functional design specification. Integration Test: Ensures that the integrated sub-modules work together properly. User Acceptance Test: UAT is a test run for a customer to demonstrate functionality [5]. This test was carried out by a user to test the software, and to a large extent the user was satisfied. Regression Test: Regression testing means rerunning test cases from existing test suites to build confidence that software changes have no unintended side effects. After having finished working on the software, some changes were made, hence this test was carried out and the software passed in the test. V. CONCLUSIONS This paper has presented the development of iCU a software for detection of plagiarism in written text files. The developed software was implemented in C# using a modular approach. It comprises of three sub-modules each performing a different 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 04 32 APPENDIX TABLE 1. USE CASE NARRATIVE FOR ICU SUB-MODULES Narrative Global Comparison Similarity Detection File Comparison Brief Description This use case describes how the user operates this part of the project. This use case describes how the actor can use the project as a search engine. The actor uses this section to ascertain the case of plagiarism in the case of an already-existing contention. Actor Lecturer. Lecturer Lecturer. Flowof Events Basic Flow Basic Flow Basic Flow This use case begins when the actor/user decides to implement this section of the project. 1. The actor clicks the button labeled ‘compare files online’. 1. The actor clicks the button and the page appears. 1. The actor clicks on the global comparer button. 2. The page appears. 3. The actor is required to choose/select a file. 4. Once the actor selects a file and clicks the button ‘compare files’, the system connects to an online server and compares the file with other files on the database of the server it has connected to. 2. The actor opens a file and selects some part of it. 3. The actor copies the text into the textbox shown and clicks ‘compare files’. 2. The actor selects the two files to be compared. 3. After this, the actor clicks the button ‘compare files’. 4. The system just like in the case of the global comparer connects to an online server and searches for documents that are exactly the same or closely related to the selected text. Special Requirements Internet and web browser. Internet and browser. None Pre-conditions None. None There should be alreadysaved documents on the system. Post-condition The result brings out a collection of websites and documents that are very similar to the file the actor has uploaded. After the actor uploads the selected text, the result brings as output websites that have exactly the same or closelyrelated text to the one selected from the file. The result brings out the percentage of similarity between the two files. Extension Points None. None. None. 111204-3838 IJECS-IJENS © August 2011 IJENS IJENS