System of the automatic recognition of reproduced fragments of the

advertisement
XV International PhD Workshop
OWD 2013, 19–22 October 2013
System of the automatic recognition of reproduced fragments
of the text documents as a tool of improvement the quality
control of educational process
Yury Krapivin, Brest State Technical University
Abstract
The article presents the solution of plagiarism
identification problem using the system of the
automatic recognition of reproduced fragments of
the text documents, which was tested on the real
data - graduate works of students.
1. Introduction.
Last decades have been characterized by the rapid
development of information technologies all over
the world, which is reflected in the popularization of
the electronic form of information storage,
accumulation and processing in all areas of human
activity. The quantity of information that is located
on the Internet is increasing every year. The Internet
becomes both the mean of information distribution
and storage. Besides, the rapidly increasing number
of Internet-resources with various content and
simple intuitive interfaces, that provide access to fulltext databases, significantly simplifies the work even
of an inexperienced user, enabling to satisfy almost
any information need.
2. Problem definition.
Constantly increasing amount of information,
which is mostly represented as text – documents of
various formats (DOC, PDF, HTML and so on),
available on the Internet, besides the obvious
advantages creates a lot of problems. One of them is
automatic plagiarism identification, that usually
refers to intentional attribution of authorship to
someone else’s work of literature, science, art,
invention work or innovation proposal (fully or
partly). Cases of plagiarism can also be
unpremeditated, for example, due to the strong
external informational influence that can be
manifested in the use of distinctive ideas or specific
way of their expression as well as failure to comply
with generally accepted rules of citation, in case of
the information presented in text form.
It is reasonable to carry out the implementation
of the mentioned problem in two following steps:
the identification of the equivalent (exact
match or match within lexical and grammatical
synonymy) fragments of the given text document
and text documents from the given database or
those available from the Internet–resources;
the analysis of the equivalent fragments in
terms of their adoptions, with the involvement of
experts, i.e. regarding plagiarism identification [1].
Thus we concern the recognition of the
reproduced fragments of the text documents i.e.
those fragments of the given document (input
document) that were adopted from other documents
that, ultima analysi, can be found in the given
multilingual full-text database (e.g. RussianBelorussian one, in our case) [2].
The base functionality of the automatic
recognition system of reproduced fragments of the
text documents was described in the work [2]. The
system consists of the following main subsystems:
the subsystem of the language identification of the
text document, the subsystem of the machine
translation, the automatic indexing and retrieval of
the relevant documents subsystem and the
subsystem of the identification of the equivalence of
the fragments of the documents. This work presents
the testing results of the designated system, obtained
while processing graduate works of students, with
the purpose of the automatic recognition of
reproduced fragments of the text documents.
3. Testing of the system.
Writing a graduate work – is an important step in
training experts, during which the general readiness
of university graduates for professional activity is
tested, the knowledge and practical skills, obtained
by students, are strengthened, their ability of creative
thinking is revealed as well as the ability to solve
professional and scientific tasks independently, in
accordance with the selected speciality.
The graduate work is created on the basis of the
pre-graduation practice data, term and research
papers, fulfilled by the student independently during
their studies at the university.
383
The explanatory note of the graduate work is a
text document the structure and volume of which
slightly varies depending on the speciality. It reflects
the sequence of the operations, explains graphic
material and describes the results, achieved while
writing the graduate work. It usually consists of
several sections which logically unite thematic
information. As a rule, those sections are devoted to:
1. research and systematic analysis of the
subject area or the automation object, specifying the
requirements or tasks needed to be resolved;
2. evaluation of the existing topic solving
techniques of the assigned tasks and valid choice of
the appropriate concept;
3. design per-se, according to the chosen
concept and the existing guidelines and standards
regulating this process, with the description of the
competitive advantages, that the obtained results
might possess.
We used the explanatory notes of the students’
graduate works of both humanitarian and technical
specializations for the operation of the system, 94
text documents in all.
The testing was carried out according to the
following scheme: the text document entered the
system, after that the key words were marked out of
it automatically. They were the basis of a search
query to the Google search engine [3], which, in
turn, solved the problem of the retrieval of the
documents relevant to the input one, with the
purpose to recognize the subsequent automatic
adoptions from the obtained set of the relevant
documents. Only the first 20 retrieved documents
were considered as potential sources of adoptions
for the analyzed document. In regard to the
formation of search query, the interactive mode of
the program was used, whereby the user had the
possibility to choose from the automatically marked
out key words, which, in his opinion, express the
content of the analyzed document in the most
comprehensive way. This approach allowed to
relieve the user of the necessity to analyze the input
document in detail, which usually presupposes
considerable amount of time. Since the user was,
usually, an expert (for example, a scientific adviser)
in the subject area the analyzed document belonged
to, he had the ability to express, as accurately and
quickly, as possible the information need in the
terms of the key words suggested by the system.
The testing of the first documents allowed to
make the assumption, according to which authors
adhered to the certain strategy while writing their
works: they used thematic groups of Internetresources to obtain text fragments with the purpose
to use them in certain sections of the explanatory
note. There was the difference in the topic of the
works and in the structure of the explanatory notes.
The thematic groups of Internet-resources came
from the sites with the access to collections of
essays, graduate works and theses, as well as sites of
news agencies, enterprise portals, e-business portals,
online communities. Authors, as a rule, used
collections of essays, graduate works and theses to
represent theoretical considerations and problem
solving techniques in sections 2 and 3 of the
mentioned above generalized structure of the
explanatory note, whereas the documents
representing the state of the subject area or the
automation object (for example, technological or
economic indicators: profit and sales margins, costs
and typical development markers) allocated on sites
of news agencies, enterprise and e-business portals
were the source of information for the sections 1
and 3. The analyzed documents contain the
adoptions both on the level of ideas and the ways of
their expression, wherein the former, probably, exists
due to the usage of the standard problem solving
techniques submitted to graduate work.
Taking into consideration the facts mentioned
above, the decision was made about the need to
change the approach to the automatic processing of
the explanatory note, it means to process it by
sections: to mark out the key words from every
section and to perform the search for the relevant
documents focusing on appropriate thematic
Internet-resources. For this purpose we used the
collections of essays and theses, including
http://bibliofond.ru, http://bestreferat.ru. This
enabled to increase the quality of the identification
of adopted fragments by increasing the recall, which
presupposes both automatic selection of the key
words, that more accurately reflect the content of the
document, which is consistent with the metric used
by the selection – TF-IDF weighting method [4], and
identification of the potential sources of relevant
documents.
As it has already been mentioned, 94 documents
were tested by the system, and the size of each one
was 91974 characters in average. The number of
items in bibliography was 14.12 per one document,
including printed editions (books and manuals) – 7.0,
Internet-resources – 3.18. The total size of
equivalent fragments, identified by the system
amounted to 532003 characters, with 5660 characters
per one document (6.15% from the total size of
analyzed documents).
The number of retrieved resources, which
comprised equivalent fragments, was 205 or 2.18 per
one document. The equivalent fragments occurred in
the collections of essays and theses the most
frequently (in 44.68% of works), including
http://bibliofond.ru
–
26%
of
cases,
http://knowledge.allbest.ru
–
11.7%
and
http://bestreferat.ru – 6.38%. At the same time
6.38% of works cited the collections of essays and
theses, and the most frequently cited resources
384
(23.4% of works) were http://ru.wikipedia.org,
http://habrahabr.ru.
At least one equivalent fragment was identified in
84% (79 from 94) of works. The size of equivalent
fragments in the analyzed group exceeded the
average one (5660 characters) in 36.17% of cases,
meanwhile the mentioned works comprised 104
references in the bibliography, i.e. 34.78% from the
total number of references to Internet-resources.
Among those 34 analyzed documents, the
number of works with references to Internetresources was 35.29% (12 works), and the most
frequently
cited
resources
were
http://ru.wikipedia.org
–
17.64%,
http://knowledge.allbest.ru and http://habrahabr.ru
– 8.82%, respectively. The most frequently (in
88.23% of works) the equivalent fragments occurred
in the collections of essays and theses, including
http://bibliofond.ru
–
47.05%
of
cases,
http://knowledge.allbest.ru
–
29.41%,
http://bestreferat.ru – 11.76%.
4. Conclusion.
The results submitted above suggest the
possibility of application of the developed system to
improve the quality control of educational process at
the stage of graduate works of the students of the
universities.
5. Bibliography And Authors
[1] Krapivin Y.: Automatic identification of the
semantically equivalent fragments of the text documents,
proceedings of OWD’2012, 20-23 October
2012, Wisla
[2] Krapivin Y.: Plagiarism identification in multilingual
information
environment,
proceedings
of
OWD’2011, 22-25 October 2011, Wisla
[3] http://www.google.com/
[4] Robertson S.: Understanding Inverse Document
Frequency: On Theoretical Arguments for IDF //
Journal of Documentation. – 2004. – № 60 (5).
– P.503-520
Authors:
Mr. Yury Krapivin
Brest State Technical University
Moskovskaja str., 267
224017 Brest, Belarus
tel. +375 297 98 81 46
fax +375 162 42 21 27
email: ybox@list.ru
385
Download