Impact of automated translation on mining knowledge from text data

advertisement
Conference of Doctoral Students PEFNet, November 19, 2015,
Brno, Czech Republic
Impact of automated translation on
mining knowledge from text data
Luděk Svozil1
1Department
of Informatics, Faculty of Business and Economics, Mendel University in Brno,
Zemědělská 1, 613 00 Brno, Czech Republic, e-mail: xsvozil@mendelu.cz
Abstract
Keywords: first, second, last
1. Introduction
The discipline concerned with mining useful knowledge from large amounts of text data,
known as text mining, has gained great attention along with the growth of volumes of
available text data from many sources. These sources may be collected from many
different countries represented in several languages.
Together with the growth of volumes of available text data grows the ability of
machine translators to convert multi-language documents to English, where the most
advanced tools are present.
2. Methodology and Data
As test data I have used reviews of hotel visitors that are written in different languages. I
have chosen English, Spannish, German, French, Polish And Czech language. Each review
is labeled as positive or negative which gives us the option to compare and measure an
impact of different pre-processing methods such as translation or stemming on
classification.
2.1. Data preparation
After specific adjustments (discussed bellow in chapter 2.3) to each group there is some
common work that needs to be done on data to prepare them for classification, which is
used to determine the impact of these specific adjustments.
To begin with, unwanted characters such as punctuation or digits are removed. Also
words with low global frequency (lower than 4) are removed, because they do not bear
PEFNet, November, 19, 2015
2
much classification significance and it speeds up the process. For the same reason words
with character length shorter than 2 were removed.
Then each document (review) is converted to vector representation. The set of
documents then forms a term matrix. In term matrix, each word present in the set of
documents is represented as number in its own column. The number is a weight
consisting from local and global part. The scheme I chose for this experiment is TF-IDF
because of its versatility.
2.1.1. TF-IDF
todo
2.2. Classification algorithms used
The main tool I use for comparison and evaluation is decision tree C5.0. Unlike other
machine learning algorithms like support vector machines or neural networks, decision
tree is interpretable by human. Words that affect the classification most can be found
sitting at tops of the decision trees made. Also list of the attributes (words) is made
based on their influence. These lists can be then compared among languages.
2.3. Actual preparation of single test groups
In each language, four groups are made using different pre-processing steps so they can
be compared to each other.
Then they all are further processed and converted to vector representation as
explained in chapter.
In the end, 3-fold cross validation C5.0 algorithm is performed on each group and
outputs like classification error or the most significant words are compared and
evaluated.
2.3.1. Original (control) group
20 000 of text labeled documents in one language are just converted to vector
representation.
2.3.2. Translated group
The same documents as in control group are first translated to English one by one using
google.translate.com.
2.3.3. Stemmed group
Stemming is applied on those documents from control group.
2.3.4. Translated and stemmed
English stemming is applied to document from Translated group.
PEFNet, November, 19, 2015
3
3. Results
Table 1: C5.0 classification error for each group and language
ES
FR
PL
CS
DE
Original
14,10%
14,10%
12,40%
14,60%
12,70%
Translated
14,10%
13,30%
11,30%
12,70%
12,00%
Stemmed
15,30%
14,00%
11,90%
11,80%
13,50%
Translated and stemmed
15,50%
15,50%
12,80%
13,70%
14,10%
Figure 1: C5.0 classification error
18.00%
16.00%
14.00%
12.00%
10.00%
8.00%
6.00%
4.00%
2.00%
0.00%
Original
Translated
Original stemmed
Translated stemmed
es
fr
pl
cs
de
4. Discussion and Conclusions
In discussion, please provide a confrontation of the achieved results with previously
published papers, author’s opinion of established differences, his/her attitude to the
results. The discussion section also provides a space to outline the need of further
potential solution or importance for the development of science, society or practice.
Acknowledgements
This work was supported by the research grant IGA of Mendel University in Brno No.
16/2015.
References
Provide in-text citations using the following style Name Author (year of publication). In
the final list of references, format citations using the Harvard style e.g. (Comfort, 1997).
Arrange the citations in alphabetical order, based on the first author’s name, without
numbering. The reference list must contain citations of all used sources and cannot
contain citations of sources which were not actually used.
PEFNet, November, 19, 2015
4
BENEŠOVÁ, A., ŘEZNÍČEK, V. a BLAŽEK, J. 1997. Hodnocení souboru genotypů jabloní
vyselektovaných na rezistenci vůči strupovitosti (Venturia inaequalis Cke. Vint.).
Acta Univ. Agric. Silvic. Mendelianae Brun., 46(4): 47–56.
COMFORT, A. 1997. A good age. 2nd Edition. London: Mitchell Beazley.
HOLLIDAY, A., HYDE, M. and KULLMAN, J. 2004. Intercultural communication: an
advanced resource book. London: Routledge. [Online]. Available at:
http://www.dawsonera.com/. [Accessed: 15 August 2011].
JONES, P. and EVANS, J. 2006. Urban regeneration, governance and the state: exploring
notions of distance and proximity. Urban Studies 43(9): 1491–1509. Academic
Search Complete [Online]. Available at: http://web.ebscohost.com. [Accessed 2010,
August 17].
ROEDER, K., HOWDESHELL, J., FULTON, F., et. al. 1967. Nerve cells and insect behavior.
Cambridge, MA: Harvard University Press.
SATTLER, M.A. 2007. Education for a more sustainable architecture. In: Sun, wind and
architecture: proceedings of the 24th International Conference on Passive and Low
Energy Architecture. National University of Singapore, 22–24 November. Singapore:
Department of Architecture, National University of Singapore, 844–851.
WIT, J. S., PONEMAN, D. B. and GALLUCI, R. L. 2004. Going critical : the first North Korean
nuclear crisis. Washington, D.C.: Brookings Institution Press.
Download