Automatic Plagiarism detection Charlie Daly Jane Horgan Dublin City University. Overview • Context • How it works (overview) • Comparison with other plagiarism detection systems • How it works (details) – Marks the original (with a watermark) – Invisibly • Results Context • It is not – catching people breaking copyright – detecting plagiarism in essays etc. • It only works for programs, specifically when students submit a program for an assingment. • Plagiarism is a huge problem on many programming courses. Why are the so many systems? • Lecturers who are also programmers get upset when they see their students copying their assignements. – It is seen as an affront – So they write a program. • 'Efficiency' in education => large classes sizes => manual detection is difficult. So why another system? • All previous systems use pair-wise comparison. Individual programs are compared against the other programs. • This means – they are programming-language specific – they don't work across years. – they cannot identify the original author. So how does our technique work? • When a student submits a program, the program is marked with a watermark indicating the author. • If the student subsequently gives an electronic copy of the program to another student, then the watermark will be recognised by the system as soon as it is submitted. But ... • Need to be able to modify the original student's file • The watermark needs to be invisible to the student. The process Program Here's the student program Stored on a hard disk Watermark The student submits the program The watermark is added. Hard Disk On the student's own hard disk! Compared to previous systems + Can detect plagiarism as soon as submitted + Identifies the author + Programming-language independent + Works with tiny programs - Only works with an electronic copy - Easy to bypass if students know about it - Plagiarising student must get a copy after it has been submitted bu the author RoboProf provides infrastructure • RoboProf is a learning environment. • Automatically sets and marks simple assignments. • The Student submits a program, which is compiled and run on the student's machine. – an applet with read-write access is used to manage the compilation and marking. • The program output is then sent to the server for marking. RoboProf The student a Thewrites Student program and logs on it submits TheAssignment program and Results are returned output arestudent sent to the to the Specification server for marking Server An applet compiles and runs the program locally Browser Part 1: modifying the student Program • Now that an applet can write to the student's disk, it can modify the student's file (to add the watermark). • Only problem remaining ... how do we implement the watermark. The Watermark • Needs to be invisible to the student. • Needs to encode – the student ID – the year The Watermark • use 10 binary digits for the student ID, => can distinguish 1024 students. • use 4 binary digits for the year. • Also use an ID for the assignment and record which attempt it is (RoboProf allows students to resubmit a program to improve the mark). • Checksum (4 digits) The watermark • The binary code requires 34 bits (10+4+10+6+4). • This code is written directly onto the file. Student ID Year 0000101010 0001 000010110 000111 0000 #include <stdio.h> main() { } Making it invisible A space is used to represent the binary digit 0 and a tab is used to represent the binary digit 1. 0000101010 0000101010 0001 000010110 000111 #include <stdio.h> main() { } Making it invisible A space is used to represent the binary digit 0 and a tab is used to represent the binary digit 1. 0000101010 becomes space tab invisible! Results • We used the plagiarism detector as part of RoboProf on a group of students (283). There were two main parts to the course, continuous assessment and a programming exam. • The continuous assessment was to be done in the students' own time and was subject to plagiarism whereas, the programming exam was supervised. Results • We compared the exam results of those who plagiarised (40%) with those who didn't • The results are unsurprising: plagiarists performed less well in the exam. And the more they plagiarised, the worse they performed. • Also plagiarists submitted their continuous assessment on average a week later than their honest peers. Frequency Incidence of plagiarism Number copied Exam mark Exam Results Number copied Completion date copied original The end Questions • What happens if a program is submitted which already contains a watermark? • It can happen legitimately if a student resubmits a program • So the watermark is checked against the submitter's ID, and if they don't match the lecturer is emailed and investigates further. • Then the watermark is overwritten => can detect chains of plagiarism. Question Eile • Why did you only monitor plagiarism; why not take any action? • There are three answers: – Resources: The university has machinery in place to deal with plagiarism. It is very bureaucratic and soaks up time. – Some students accidentally committed plagiarism; testing the system. – Need corroborating evidence; can't let the trick be known. Question 3 • "But won't the watermark that is sent to the server have been just created by the system? It'll just read the watermark it generated." • No. It reads the program and then doctors it. The server gets the unadultered program, the student is left with the modified program. Question 4 • Any problems in practice? • Yes, a modern IDE can detect when the source has been modified and askes if you wish to reload the buffer. Hasn't been fixed yet. • You need to correctly set up applet security. • A student may save the file after it has been modified (clean version still in the editor). How much is original • Inserting a watermark unknown to the user (as far as I know). • Using unseen whitespace has been used to detect copyright infringment (it was unknowingly inserted by the author).