XV International PhD Workshop OWD 2013, 19–22 October 2013 System of the automatic recognition of reproduced fragments of the text documents as a tool of improvement the quality control of educational process Yury Krapivin, Brest State Technical University Abstract The article presents the solution of plagiarism identification problem using the system of the automatic recognition of reproduced fragments of the text documents, which was tested on the real data - graduate works of students. 1. Introduction. Last decades have been characterized by the rapid development of information technologies all over the world, which is reflected in the popularization of the electronic form of information storage, accumulation and processing in all areas of human activity. The quantity of information that is located on the Internet is increasing every year. The Internet becomes both the mean of information distribution and storage. Besides, the rapidly increasing number of Internet-resources with various content and simple intuitive interfaces, that provide access to fulltext databases, significantly simplifies the work even of an inexperienced user, enabling to satisfy almost any information need. 2. Problem definition. Constantly increasing amount of information, which is mostly represented as text – documents of various formats (DOC, PDF, HTML and so on), available on the Internet, besides the obvious advantages creates a lot of problems. One of them is automatic plagiarism identification, that usually refers to intentional attribution of authorship to someone else’s work of literature, science, art, invention work or innovation proposal (fully or partly). Cases of plagiarism can also be unpremeditated, for example, due to the strong external informational influence that can be manifested in the use of distinctive ideas or specific way of their expression as well as failure to comply with generally accepted rules of citation, in case of the information presented in text form. It is reasonable to carry out the implementation of the mentioned problem in two following steps: the identification of the equivalent (exact match or match within lexical and grammatical synonymy) fragments of the given text document and text documents from the given database or those available from the Internet–resources; the analysis of the equivalent fragments in terms of their adoptions, with the involvement of experts, i.e. regarding plagiarism identification [1]. Thus we concern the recognition of the reproduced fragments of the text documents i.e. those fragments of the given document (input document) that were adopted from other documents that, ultima analysi, can be found in the given multilingual full-text database (e.g. RussianBelorussian one, in our case) [2]. The base functionality of the automatic recognition system of reproduced fragments of the text documents was described in the work [2]. The system consists of the following main subsystems: the subsystem of the language identification of the text document, the subsystem of the machine translation, the automatic indexing and retrieval of the relevant documents subsystem and the subsystem of the identification of the equivalence of the fragments of the documents. This work presents the testing results of the designated system, obtained while processing graduate works of students, with the purpose of the automatic recognition of reproduced fragments of the text documents. 3. Testing of the system. Writing a graduate work – is an important step in training experts, during which the general readiness of university graduates for professional activity is tested, the knowledge and practical skills, obtained by students, are strengthened, their ability of creative thinking is revealed as well as the ability to solve professional and scientific tasks independently, in accordance with the selected speciality. The graduate work is created on the basis of the pre-graduation practice data, term and research papers, fulfilled by the student independently during their studies at the university. 383 The explanatory note of the graduate work is a text document the structure and volume of which slightly varies depending on the speciality. It reflects the sequence of the operations, explains graphic material and describes the results, achieved while writing the graduate work. It usually consists of several sections which logically unite thematic information. As a rule, those sections are devoted to: 1. research and systematic analysis of the subject area or the automation object, specifying the requirements or tasks needed to be resolved; 2. evaluation of the existing topic solving techniques of the assigned tasks and valid choice of the appropriate concept; 3. design per-se, according to the chosen concept and the existing guidelines and standards regulating this process, with the description of the competitive advantages, that the obtained results might possess. We used the explanatory notes of the students’ graduate works of both humanitarian and technical specializations for the operation of the system, 94 text documents in all. The testing was carried out according to the following scheme: the text document entered the system, after that the key words were marked out of it automatically. They were the basis of a search query to the Google search engine [3], which, in turn, solved the problem of the retrieval of the documents relevant to the input one, with the purpose to recognize the subsequent automatic adoptions from the obtained set of the relevant documents. Only the first 20 retrieved documents were considered as potential sources of adoptions for the analyzed document. In regard to the formation of search query, the interactive mode of the program was used, whereby the user had the possibility to choose from the automatically marked out key words, which, in his opinion, express the content of the analyzed document in the most comprehensive way. This approach allowed to relieve the user of the necessity to analyze the input document in detail, which usually presupposes considerable amount of time. Since the user was, usually, an expert (for example, a scientific adviser) in the subject area the analyzed document belonged to, he had the ability to express, as accurately and quickly, as possible the information need in the terms of the key words suggested by the system. The testing of the first documents allowed to make the assumption, according to which authors adhered to the certain strategy while writing their works: they used thematic groups of Internetresources to obtain text fragments with the purpose to use them in certain sections of the explanatory note. There was the difference in the topic of the works and in the structure of the explanatory notes. The thematic groups of Internet-resources came from the sites with the access to collections of essays, graduate works and theses, as well as sites of news agencies, enterprise portals, e-business portals, online communities. Authors, as a rule, used collections of essays, graduate works and theses to represent theoretical considerations and problem solving techniques in sections 2 and 3 of the mentioned above generalized structure of the explanatory note, whereas the documents representing the state of the subject area or the automation object (for example, technological or economic indicators: profit and sales margins, costs and typical development markers) allocated on sites of news agencies, enterprise and e-business portals were the source of information for the sections 1 and 3. The analyzed documents contain the adoptions both on the level of ideas and the ways of their expression, wherein the former, probably, exists due to the usage of the standard problem solving techniques submitted to graduate work. Taking into consideration the facts mentioned above, the decision was made about the need to change the approach to the automatic processing of the explanatory note, it means to process it by sections: to mark out the key words from every section and to perform the search for the relevant documents focusing on appropriate thematic Internet-resources. For this purpose we used the collections of essays and theses, including http://bibliofond.ru, http://bestreferat.ru. This enabled to increase the quality of the identification of adopted fragments by increasing the recall, which presupposes both automatic selection of the key words, that more accurately reflect the content of the document, which is consistent with the metric used by the selection – TF-IDF weighting method [4], and identification of the potential sources of relevant documents. As it has already been mentioned, 94 documents were tested by the system, and the size of each one was 91974 characters in average. The number of items in bibliography was 14.12 per one document, including printed editions (books and manuals) – 7.0, Internet-resources – 3.18. The total size of equivalent fragments, identified by the system amounted to 532003 characters, with 5660 characters per one document (6.15% from the total size of analyzed documents). The number of retrieved resources, which comprised equivalent fragments, was 205 or 2.18 per one document. The equivalent fragments occurred in the collections of essays and theses the most frequently (in 44.68% of works), including http://bibliofond.ru – 26% of cases, http://knowledge.allbest.ru – 11.7% and http://bestreferat.ru – 6.38%. At the same time 6.38% of works cited the collections of essays and theses, and the most frequently cited resources 384 (23.4% of works) were http://ru.wikipedia.org, http://habrahabr.ru. At least one equivalent fragment was identified in 84% (79 from 94) of works. The size of equivalent fragments in the analyzed group exceeded the average one (5660 characters) in 36.17% of cases, meanwhile the mentioned works comprised 104 references in the bibliography, i.e. 34.78% from the total number of references to Internet-resources. Among those 34 analyzed documents, the number of works with references to Internetresources was 35.29% (12 works), and the most frequently cited resources were http://ru.wikipedia.org – 17.64%, http://knowledge.allbest.ru and http://habrahabr.ru – 8.82%, respectively. The most frequently (in 88.23% of works) the equivalent fragments occurred in the collections of essays and theses, including http://bibliofond.ru – 47.05% of cases, http://knowledge.allbest.ru – 29.41%, http://bestreferat.ru – 11.76%. 4. Conclusion. The results submitted above suggest the possibility of application of the developed system to improve the quality control of educational process at the stage of graduate works of the students of the universities. 5. Bibliography And Authors [1] Krapivin Y.: Automatic identification of the semantically equivalent fragments of the text documents, proceedings of OWD’2012, 20-23 October 2012, Wisla [2] Krapivin Y.: Plagiarism identification in multilingual information environment, proceedings of OWD’2011, 22-25 October 2011, Wisla [3] http://www.google.com/ [4] Robertson S.: Understanding Inverse Document Frequency: On Theoretical Arguments for IDF // Journal of Documentation. – 2004. – № 60 (5). – P.503-520 Authors: Mr. Yury Krapivin Brest State Technical University Moskovskaja str., 267 224017 Brest, Belarus tel. +375 297 98 81 46 fax +375 162 42 21 27 email: ybox@list.ru 385