Opinion Mining and Topic Categorization with Novel Term Weighting

advertisement
Opinion Mining and Topic
Categorization with Novel
Term Weighting
Roman Sergienko, Ph.D student
Tatiana Gasanova, Ph.D student
Ulm University, Germany
Shaknaz Akhmedova, Ph.D. student
Siberian State Aerospace University, Krasnoyarsk, Russia
2
Contents
Motivation
Databases
Text preprocessing methods
The novel term weighting method
Features selection
Classification algorithms
Results of numerical experiments
Conclusions
3
Motivation
The goal of the work is to evaluate the
competitiveness of the novel term
weighting in comparison with the standard
techniques for opining mining and topic
categorization.
The criteria are:
1) Macro F-measure for the test set
2) Computational time
Databases: DEFT’07 and DEFT’08
Corpus
Books
Size
Train size = 2074
Test size = 1386
Vocabulary = 52507
Train size = 2537
Test size = 1694
Vocabulary = 63144
Train size = 17299
Test size = 11533
Vocabulary = 59615
Classes
0: negative,
1: neutral,
2: positive
0: negative,
1: neutral,
2: positive
0: against,
1: for
Corpus
T1
Size
Train size = 15223
Test size = 10596
Vocabulary = 202979
T2
Train size = 23550
Test size = 15693
Vocabulary = 262400
Classes
0: Sport,
1: Economy,
2: Art,
3: Television
0: France,
1: International,
2: Literature,
3: Science,
4: Society
Games
Debates
4
The existing text preprocessing
methods
 Binary preprocessing
 TF-IDF (Salton and Buckley, 1988)
 Confident Weights (Soucy and Mineau, 2005)
5
6
The novel term weighting method
L – the number of classes;
ni – the number of instances of the i-th class;
Nji – the number of j-th word occurrence in all instances of the
i-th class;
Tji=Nji/ni – the relative frequency of j-th word occurrence in
the i-th class;
Rj=maxiTji, Sj=arg(maxiTji) – the number of class which we
assign to j-th word.
Сj 
1
(R j 
L
T
i 1
ji
1
L
T

L 1
i 1
i S
j
ji
)
7
Features selection
1) Calculating a relative frequency for each word in
the each class
2) Choice for each word the class with the
maximum relative frequency
3) For each classification utterance calculating
sums of weights of words which belong to each
class
4) Number of attributes = number of classes
8
Classification algorithms
 k-nearest neighbors algorithm with distance
weighting (we have varied k from 1 to 15);
 kernel Bayes classifier with Laplace correction;
 neural network with error back propagation
(standard setting in RapidMiner);
 Rocchio classifier with different metrics and γ
parameter;
 support vector machine (SVM) generated and
optimized with Co-Operation of Biology Related
Algorithms (COBRA) (Akhmedova and Semenkin,
2013).
9
Computational effectiveness
DEFT’07
DEFT’08
10
The best values of F-measure
Problem
Fmeasure
The best
known
value
Term weighting
method
Classification
algorithm
Books
0.619
0.603
The novel TW
SVM
Games
0.720
0.784
ConfWeight
k-NN
Debates
0.714
0.720
ConfWeight
SVM
T1
0.856
0.894
The novel TW
SVM
T2
0.851
0.880
The novel TW
SVM
Comparison of ConfWeight and the
novel term weighting
11
Problem
ConfWeight The novel
TW
Difference
Books
0.588
0.619
+0.031
Games
0.720
0.712
-0.008
Debates
0.714
0.700
-0.014
T1
0.855
0.856
+0.001
T2
0.851
0.820
+0.031
12
Conclusions
 The novel term weighting method gives similar or
better classification quality than the ConfWeight
method but it requires the same amount of time as
TF-IDF.
Download