Conference of Doctoral Students PEFNet, November 19, 2015, Brno, Czech Republic Impact of automated translation on mining knowledge from text data Luděk Svozil1 1Department of Informatics, Faculty of Business and Economics, Mendel University in Brno, Zemědělská 1, 613 00 Brno, Czech Republic, e-mail: [email protected] Abstract Keywords: first, second, last 1. Introduction The discipline concerned with mining useful knowledge from large amounts of text data, known as text mining, has gained great attention along with the growth of volumes of available text data from many sources. These sources may be collected from many different countries represented in several languages. Together with the growth of volumes of available text data grows the ability of machine translators to convert multi-language documents to English, where the most advanced tools are present. 2. Methodology and Data As test data I have used reviews of hotel visitors that are written in different languages. I have chosen English, Spannish, German, French, Polish And Czech language. Each review is labeled as positive or negative which gives us the option to compare and measure an impact of different pre-processing methods such as translation or stemming on classification. 2.1. Data preparation After specific adjustments (discussed bellow in chapter 2.3) to each group there is some common work that needs to be done on data to prepare them for classification, which is used to determine the impact of these specific adjustments. To begin with, unwanted characters such as punctuation or digits are removed. Also words with low global frequency (lower than 4) are removed, because they do not bear PEFNet, November, 19, 2015 2 much classification significance and it speeds up the process. For the same reason words with character length shorter than 2 were removed. Then each document (review) is converted to vector representation. The set of documents then forms a term matrix. In term matrix, each word present in the set of documents is represented as number in its own column. The number is a weight consisting from local and global part. The scheme I chose for this experiment is TF-IDF because of its versatility. 2.1.1. TF-IDF todo 2.2. Classification algorithms used The main tool I use for comparison and evaluation is decision tree C5.0. Unlike other machine learning algorithms like support vector machines or neural networks, decision tree is interpretable by human. Words that affect the classification most can be found sitting at tops of the decision trees made. Also list of the attributes (words) is made based on their influence. These lists can be then compared among languages. 2.3. Actual preparation of single test groups In each language, four groups are made using different pre-processing steps so they can be compared to each other. Then they all are further processed and converted to vector representation as explained in chapter. In the end, 3-fold cross validation C5.0 algorithm is performed on each group and outputs like classification error or the most significant words are compared and evaluated. 2.3.1. Original (control) group 20 000 of text labeled documents in one language are just converted to vector representation. 2.3.2. Translated group The same documents as in control group are first translated to English one by one using google.translate.com. 2.3.3. Stemmed group Stemming is applied on those documents from control group. 2.3.4. Translated and stemmed English stemming is applied to document from Translated group. PEFNet, November, 19, 2015 3 3. Results Table 1: C5.0 classification error for each group and language ES FR PL CS DE Original 14,10% 14,10% 12,40% 14,60% 12,70% Translated 14,10% 13,30% 11,30% 12,70% 12,00% Stemmed 15,30% 14,00% 11,90% 11,80% 13,50% Translated and stemmed 15,50% 15,50% 12,80% 13,70% 14,10% Figure 1: C5.0 classification error 18.00% 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% Original Translated Original stemmed Translated stemmed es fr pl cs de 4. Discussion and Conclusions In discussion, please provide a confrontation of the achieved results with previously published papers, author’s opinion of established differences, his/her attitude to the results. The discussion section also provides a space to outline the need of further potential solution or importance for the development of science, society or practice. Acknowledgements This work was supported by the research grant IGA of Mendel University in Brno No. 16/2015. References Provide in-text citations using the following style Name Author (year of publication). In the final list of references, format citations using the Harvard style e.g. (Comfort, 1997). Arrange the citations in alphabetical order, based on the first author’s name, without numbering. The reference list must contain citations of all used sources and cannot contain citations of sources which were not actually used. PEFNet, November, 19, 2015 4 BENEŠOVÁ, A., ŘEZNÍČEK, V. a BLAŽEK, J. 1997. Hodnocení souboru genotypů jabloní vyselektovaných na rezistenci vůči strupovitosti (Venturia inaequalis Cke. Vint.). Acta Univ. Agric. Silvic. Mendelianae Brun., 46(4): 47–56. COMFORT, A. 1997. A good age. 2nd Edition. London: Mitchell Beazley. HOLLIDAY, A., HYDE, M. and KULLMAN, J. 2004. Intercultural communication: an advanced resource book. London: Routledge. [Online]. Available at: http://www.dawsonera.com/. [Accessed: 15 August 2011]. JONES, P. and EVANS, J. 2006. Urban regeneration, governance and the state: exploring notions of distance and proximity. Urban Studies 43(9): 1491–1509. Academic Search Complete [Online]. Available at: http://web.ebscohost.com. [Accessed 2010, August 17]. ROEDER, K., HOWDESHELL, J., FULTON, F., et. al. 1967. Nerve cells and insect behavior. Cambridge, MA: Harvard University Press. SATTLER, M.A. 2007. Education for a more sustainable architecture. In: Sun, wind and architecture: proceedings of the 24th International Conference on Passive and Low Energy Architecture. National University of Singapore, 22–24 November. Singapore: Department of Architecture, National University of Singapore, 844–851. WIT, J. S., PONEMAN, D. B. and GALLUCI, R. L. 2004. Going critical : the first North Korean nuclear crisis. Washington, D.C.: Brookings Institution Press.