On the identification of unknown authors: a parametric methods.

advertisement
On the identification of unknown authors: a
comparison between SVM’s and non
parametric methods.
Paola Cerchiello1
Department of Economics, University of Pavia paola.cerchiello@eco.unipv.it
Summary. In the text classification field, great relevance is given to the classification task. In classical contest the SVM’s classifier shows good performance. However,
if a new document, belonging to an unknown author, has to be classified, the above
method fails, because of the inability to detect the text as an example of a new
category. In order to overcome this problem, a new method, based on the combined
employment of decision tree and Kruskal-Wallis test, is proposed. The methodology
applied is evaluated in terms of correspondence analysis.
Key words: text classification, Svm’s, kruskal-wallis test, decision tree, correspondence analysis
1 Introduction
With the rapid growth of online information, text categorization has become one
of the key techniques for handling and organizing data in textual format. Text
categorization techniques are an essential part of text mining and are used to classify
news documents, to find interesting information in the several web sites on line. Since
building text classifiers by hand is difficult, time-consuming and often not efficient,
it is advantageous to learn classifiers from examples.
Text categorization is the task of assigning a Boolean value to:
(dj , ci ) ∈ D × C
(1)
where D is a domain of documents and C = [c1 , . . . , c|C| ] is a set of predefined
categories. A value of T assigned to (dj , ci ) indicates a decision to file dj under ci ,
and viceversa.
In this context classical methods of classification such as support vectors machines [J98], naive bayes classifier [KHZ00], have shown different performance strictly
connected and dependant on their structural elements. However, if a new document,
belonging to an unknown author, has to be classified, the above methods fail, because
of the inability to detect the text as an example of a new category. In order to show
the main results of the proposed methodologies, a varying of the standard Reuters
database containing 114 different authors and 468 words, has been employed. .
942
Paola Cerchiello
2 Classical methods
According to the text classification literature, the best results in the field are obtained by means of Support vector machines (SVMs). They represent a set of related
supervised learning methods used both for classification and regression scope. When
used for classification, the SVM’s algorithm creates a hyperplane that separates the
data into two classes with the maximum-margin. In the simplest binary case, given
training examples labelled either ”yes” or ”no”, a maximum-margin hyperplane is
identified which splits the ”yes” from the ”no” training examples, such that the distance between the hyperplane and the closest examples (the margin) is maximized.
Being things so, we applied support vectors machines to a sample containing three
different authors (number 5, 60, 110) in order to show their good performance. After
that we used the same fitted models on a new set containing some new documents
belonging to the old authors and some others belonging to a new one, never classified
before. In our opinion those classifiers should assign a not decision score to the new
author’s documents being not examples of any class. However, as it is exemplified
in the table 1, SVM’s assign inclusion probability far from indecision, increasing
proportionally the corresponding misclassification error.
Table 1. Inclusion probabilities derived by SVM’s classifier for the unknown author.
Auth 5 Auth 60 Auth 110
0.832
0.353
0.047
0.040
0.002
0.004
0.106
0.278
0.807
0.977
0.322
0.011
0.196
0.865
0.813
0.958
0.929
0.343
0.388
0.144
0.004
0.519
0.156
0.450
0.086
0.145
0.039
0.066
0.549
0.333
0.047
0.018
0.157
As a consequence SVM’s can not be considered useful for this kind of application,
justifying the implementation of a different methodology.
3 The theoretical background
The first step in text categorization is to transform documents, which typically are
strings of characters, into a representation suitable for the learning algorithm and
the classification task. Each distinct word wi corresponds to a feature, with the
number of times word wi occurs in the document as its value. After activities of
parsing and stemming, the number of relevant words is still very high (in the order
of 103 ), thereby features selection methods must be applied. They are addressed
to the identification of words characterized by the biggest power of discrimination
The identification of unknown authors.
943
that allows to distinguish different authors. The literature has presented several
approaches in this contest, for example [GE03], [S03], [F03]. The methodology here
proposed deals with the combined employment of two statistical tools rooted in the
family of non parametric models: decision trees and Kruskal-Wallis test.
Kruskal-Wallis test [C71]is the non parametric version of ANOVA analysis and it
represent a simple generalization of the Wilcoxon test for two independent sample.
On the basis of K independent samples n1 , . . ., nk ,in this context represented by
the different words within the bag of words , a unique big sample in created by
means of fusion of the originals k samples. The above result is ordered from the
smaller sample to the bigger one, and the rank is assigned to each one. Finally Ri
is calculated, that is the mean of the ranks of the observations in the ı̀-th sample.
The statistic is:
KW =
12
N(N+1)
P
1−
K
i=1
P
[
ni (Ri −
g
3
i=1 (ti −ti )]
(N 3 −N)
N+1 2
)
2
(2)
where the denominator is needed when there are tied observations (as it is typical
in text categorization application) and the null hypothesis, that all the k word
distributions are the same, is rejected if:
2
KW > XK−1
(3)
Classification or decision tree methods [BFOS84] are a good choice when the task
of the analysis is classification or prediction of outcomes and the goal is to generate
rules that can be easily understood and explained. In particular in text mining
field they represent trees in which internal nodes are labelled by terms, branches
departing from them are labelled by tests on the weight that the term has in the
test document, and finally leafs represent category. This kind of classifier categorizes
a test document dj by recursively testing for the weights that the terms labelling
the internal nodes have in vector dj , until a leaf node is reached; the label if this
node is then assigned to dj . As the Kruskal-Wallis test, the decision tree is a method
belonging to the family of non parametric models, so that we do not have to choose
a distribution for the terms (words) present in a document, that constitutes a not
simple problem in this kind of application.
4 Proposal
The methodology here proposed combines the main elements of both the models
above. For lack of space, we refer the readers to [CG06] for a complete and deeper
exposure of the methodological aspects. In order to show the main results of the
proposal, a varying of the standard Reuters database containing 114 different authors
and 468 words, has been employed. For each author we have a different number of
documents, at least 120, and each variable, representing a word, is labelled by a
progressive number. In figure 1, a small extract of the available data set is shown.
First of all we have randomly chosen three different authors for the analysis,
then the Kruskal-Wallis test is applied to the data set containing them, in order to
select words characterized by distribution profiles more heterogeneous and different
944
Paola Cerchiello
Fig. 1. Small extract of the analyzed data set.
between the selected authors so that they can reveal a typical and personal style
of writing for every single author. According to the test procedure each word is assigned a p-value on the basis of which a decision on the test hypothesis is taken: if
the computed p-value is under the established threshold the corresponding variable
accepts the alternative, that is the word distribution behaves differently in the populations (that are the authors). This part of the analysis is useful to eliminate words
used in a constant or semi-constant way in the different documents composed by the
selected authors. The words located as shown above, are combined, as we will see
later, with words chosen by the application of the decision tree. In fact the decision
tree is used on the data set containing the three selected authors and during every
phase, the recursive algorithm selects words that better distinguish authors from
each other. After the application of the two models we have two different sets of
words, one selected trough the Kruskal-Wallis test, and another one located trough
the decision tree. Now it is useful to remember the objective of this analysis: first of
all the dimensionality reduction to obtain just words able to reveal the typical and
personal style of writing for every single author. Thereby in order to combine words
selected through the above methods we finally consider just words selected either
by the Kruskal-Wallis test (using a p-value equal to 0.01) and by the decision tree
based on Entropy impurity measure. The below table reports those words:
Table 2. Labels of the words selected either by the test or by the decision tree.
Words Selected
5 21 435
398 183 235
14 215
As a consequence we finally obtain a very small set of words that will be used
to represent in a graphic way the single author’s profile by means of correspondence
analysis.
The identification of unknown authors.
945
As we said at the beginning, once located key words, the second objective is to
detect documents composed by an unknown author, different from the pre-classified
ones. In this context the graphic derived from the correspondence analysis can be
helpful in locating those documents, in fact text belonging to a different authors
should be placed in the neighborhood of the cloud of texts belonging to known
authors.
Fig. 2. Graphic derived from the correspondence analysis.
For sake of simplicity in this part of the analysis we have created another data set
containing only some documents of one out of the three initial authors (i.e. number
5), previously used in order to single out key words, and as unknown author we have
randomly chosen few text belonging to author number 44. Finally we have applied
correspondence analysis on that new data set, keeping track of the unknown author’s
documents.
In figure 2 the blue points represent words selected by the two non parametric
methods shown before and the red points symbolize authors documents distinguishing them through the relative code numbers (5 and 44). As the reader can simply
observe, 3 out of the 4 texts belonging to author 44 can be considered as outliers
because of their position respect to the central cloud. Thereby we can conclude that
this approach can be usefully employed, first of all to create a specific author’s pro-
946
Paola Cerchiello
file, and consequentially to use the key words found as important element in plotting
a set of new texts and in locating which ones can belong to an unknown author/s.
5 Conclusion
We want to stress out the key element of this contribution. As said before, classical
methods of classifications are not able to identify and isolate documents belonging
to authors never labelled before. As a consequence, during an automatic process
of classification the incoming documents will be filled under one of the classes previously located without considering the eventual presence of anomalous texts. As
showed in section 2, one of the possible classifier, the SVM’s, fails in this context.
Thereby we suggest first of all to reduce the dimensionality by means of the combined employment of two non parametric tools: decision tree and Kruskal-Wallis
test. In second place, correspondence analysis is considered an useful method to
represent documents and words selected by feature selection in order to depict the
lexical profile of a specific author. In other words, we try to understand the most
typical words used by an author, that constitutes the key for the identification of
new-unknown authors.
The author thanks Paolo Giudici for useful discussions and suggestions; and
MIUR for funding within the project ”Data mining methods for e-business applications”.
References
[B73]
Benzecri, J. (1973). L’analyse des données. Dunod, Paris.
[BFOS84] Breiman L., Friedman J.H., Olshen R., and Stone C. J. (1984). Classification and regression trees. Wadsworth, Belmont.
[CFG05] Cerchiello P., Figini S., Giudici P., (2005). Feature selection: A non parametric approach. Atti di convegno internazionale S.Co., Bressanone, (a
cura di Corrado Provasi), 293–298.
[CG06]
Cerchiello P., Giudici P., (2006). Statistical methods for classification of
unknown authors . Technical report.
[C71]
Conover W. J., (1971). Practical nonparametric statistics. Wiley, New
York.
[F03]
Forman G., (2003). An Extenisve empirical study of feature selection metrics for text classification. In Journal of Machine Learning Research, 3,
1289–1306.
[G03]
Giudici P., (2003). Applied data mining. Wiley.
[GE03]
Guyon I. and Elissee A., (2003). An introduction to variable and feature
selection. Journal of Machine Learning Research, 3(3),1157–1182.
[J97]
Joachims T., (1997). Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, Proceedings of ECML-98, 10th
European Conference on Machine Learning.
[J98]
Joachims T., (1998). Text categorization with support vector machines:
learning with many relevant features. In Proceedings of ECML-98, 10th
European Conference on Machine Learning (Chemnitz, Germany, 1998),
137–142.
The identification of unknown authors.
947
[KHZ00] Kim Y.H.,Hahn S.Y., and Zhang B.T., (2000). Text filtering by boosting
naive Bayes classifiers. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval
(Athens, Greece, 2000), 168–175.
[L98]
Lewis D. D., (1998). Naive Bayes at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European
Conference on Machine Learning (Chemnitz, Germany, 1998), 4–15.
[MY01] Manevitz L. M., and Yousef M., (2001). One-class svms for document
classification. Journal of Machine Learning Research 2, 139–154.
[S02]
Sebastiani F., (2002). Machine learning in automated text categorization.
ACM Computing Surveys 34(1), 1–47.
[S03]
Stoppiglia H., et al., (2003). Ranking a random feature for variable and
feature selection. In Journal of Machine Learning Research, 3, 1399–1414.
[Z97]
Zani S., (1997). Analisi dei dati statistici, volume 1: osservazioni in una
e due dimensioni. Giuffré, Milano.
[Z00]
Zani S., (2000). Analisi dei dati statistici, volume 2: osservazioni multidimensionali. Giuffré, Milano.
Related documents
Download