Opinion Mining and Topic Categorization with Novel Term Weighting Roman Sergienko, Ph.D student Tatiana Gasanova, Ph.D student Ulm University, Germany Shaknaz Akhmedova, Ph.D. student Siberian State Aerospace University, Krasnoyarsk, Russia 2 Contents Motivation Databases Text preprocessing methods The novel term weighting method Features selection Classification algorithms Results of numerical experiments Conclusions 3 Motivation The goal of the work is to evaluate the competitiveness of the novel term weighting in comparison with the standard techniques for opining mining and topic categorization. The criteria are: 1) Macro F-measure for the test set 2) Computational time Databases: DEFT’07 and DEFT’08 Corpus Books Size Train size = 2074 Test size = 1386 Vocabulary = 52507 Train size = 2537 Test size = 1694 Vocabulary = 63144 Train size = 17299 Test size = 11533 Vocabulary = 59615 Classes 0: negative, 1: neutral, 2: positive 0: negative, 1: neutral, 2: positive 0: against, 1: for Corpus T1 Size Train size = 15223 Test size = 10596 Vocabulary = 202979 T2 Train size = 23550 Test size = 15693 Vocabulary = 262400 Classes 0: Sport, 1: Economy, 2: Art, 3: Television 0: France, 1: International, 2: Literature, 3: Science, 4: Society Games Debates 4 The existing text preprocessing methods Binary preprocessing TF-IDF (Salton and Buckley, 1988) Confident Weights (Soucy and Mineau, 2005) 5 6 The novel term weighting method L – the number of classes; ni – the number of instances of the i-th class; Nji – the number of j-th word occurrence in all instances of the i-th class; Tji=Nji/ni – the relative frequency of j-th word occurrence in the i-th class; Rj=maxiTji, Sj=arg(maxiTji) – the number of class which we assign to j-th word. Сj 1 (R j L T i 1 ji 1 L T L 1 i 1 i S j ji ) 7 Features selection 1) Calculating a relative frequency for each word in the each class 2) Choice for each word the class with the maximum relative frequency 3) For each classification utterance calculating sums of weights of words which belong to each class 4) Number of attributes = number of classes 8 Classification algorithms k-nearest neighbors algorithm with distance weighting (we have varied k from 1 to 15); kernel Bayes classifier with Laplace correction; neural network with error back propagation (standard setting in RapidMiner); Rocchio classifier with different metrics and γ parameter; support vector machine (SVM) generated and optimized with Co-Operation of Biology Related Algorithms (COBRA) (Akhmedova and Semenkin, 2013). 9 Computational effectiveness DEFT’07 DEFT’08 10 The best values of F-measure Problem Fmeasure The best known value Term weighting method Classification algorithm Books 0.619 0.603 The novel TW SVM Games 0.720 0.784 ConfWeight k-NN Debates 0.714 0.720 ConfWeight SVM T1 0.856 0.894 The novel TW SVM T2 0.851 0.880 The novel TW SVM Comparison of ConfWeight and the novel term weighting 11 Problem ConfWeight The novel TW Difference Books 0.588 0.619 +0.031 Games 0.720 0.712 -0.008 Debates 0.714 0.700 -0.014 T1 0.855 0.856 +0.001 T2 0.851 0.820 +0.031 12 Conclusions The novel term weighting method gives similar or better classification quality than the ConfWeight method but it requires the same amount of time as TF-IDF.