Nadpis

advertisement
Stopwords removal influence on text mining task results
Ing. Jiří Krupník, Department of Informatics, Faculty of Business and
Economics, Mendel University, jiri.krupnik@mendelu.cz
Abstrakt
Příspěvek se věnuje analýze vlivu předzpracování textových dokumentů na výsledky úloh
textmining. Konkrétně se zde popisuje analýza vlivu odstranění stopslov, jakožto jeden ze
způsobů předzpracování dokumentů. Jsou zde prezentovány a diskutovány výsledky, kterých
bylo dosaženo experimenty s rozsáhlými textovými dokumenty pořízené v přirozených
jazycích. Zkoumal se vliv na výkonnost výsledky různých shlukovacích a klasifikačních
algoritmů.
Klíčová slova
Stopslovo, klasifikace, shlukování, text mining, textové dokumenty, kolekce, hotelové
recenze, Reuters
Abstract
This paper deals with analysis of preprocessing textual documents influence on text mining
task results. One of the possible ways of preprocessing textual documents is stopwords
removal. Here are presented and discussed results, which was achieved by experiments with
large collection of real-world document written in natural languages. The research was
focused on results of different kind of clustering and classification algorithm.
Key Words
Stopword, classification, clustering, text mining, text documents, collections, hotel reviews,
Reuters
Introduction
Text mining is a relatively new area of computer science research that tries to solve the crisis
of information overload by combining techniques from data mining, machine learning, natural
language processing, information retrieval and knowledge management. The main tasks
solving in context of text mining involved text categorization, prediction, clustering,
information extraction, sentiment analysis. (Weiss et al., 2010; Feldman and Sanger, 2007).
In order to realizing this kind of processing data, it is necessary to transform them into a
representation appropriate for the chosen algorithm and possibly can be used some
preprocessing operations. Documents written in natural language are characterized by large
number of attributes, it means number of unique words (terms). It is obvious, that part of them
has not any useful information for mining algorithm. These irrelevant words are so-called
noise words and the collection such of these words is stoplist. Involving the stopslist to
processing usually leads to decrease of precision and efficiency next tasks. Reduction of the
noisy features helps to more effectively clustering and categorization (Sinka and Corne,
2003). This mentioned results are supported by many papers conducted by: Yang and
Pedersen (1997), Rachel et al. (2005), Makrehchi and Kamel (2008), Li et al. (2009) or Dave
(2011).
Goal
The research goal was contribute to development of analysis and mining knowledge from
textual data written in natural language by performing set of experiments and evaluating
obtained results.
For this purpose is necessary to create analysis of current approaches generating
stopwords and perform set of classification and clustering experiments. Obtained results of
experiments must be evaluated and discussed with focused on comparison of single stopwords
removal method. To achieve a goal of this work is needed to introduce the text-mining issues
and the approaches of preprocessing textual documents, which also include software systems
for execution this tasks and experiments.
Methods
The stopwords extraction is based on automatization generating method, which was defined in
many publications of this authors: Yang and Pedersen (1997), Uchyigit and Clark (2008).
These methods are design for a labeled data and the following are used for experiments:

Odds Ratio (ODDR)

Information Gain (IG)

F-measure Feature Ranking (FFR)

The Chí Statistic (CHI)

Mutual Information (MI)

Coeficient Ng-Goh-Low (NGL)

Coeficient Galavotti-Sebastiani-Simi (GSS)
The fundamental text mining tasks solving in research are text classification and clustering.
These tasks are tested on modern machine-learning algorithms forming the basis of data
mining area.
The purpose of classification algorithms is to assign the correct properties (classes) for
the instance (sample) from a set of preclassified training documents. It is a supervised
machine learning (Joachims, 1998). Chosen classification methods used in experiments are:
C5.0 and SVM. Clustering is a process through which objects are classified into groups called
clusters. In the case of clustering, the problem is to group the given unlabeled collection into
meaningful clusters without any prior information. This concept are called unsupervised
machine learning (Karypis, 2003). k-Means method was chosen for testing.
Objectives and methodology
Purpose of research is found the role of stopword in categorization and clustering tasks. More
precisely, discover removal stopwords influence from text documents on the results of the text
mining tasks. Data background for outputs and conclusions are result of executed experiments
with large textual documents. In general, it is necessary to achieve a goal, performs the
following steps:

analysis of current approaches of creating stoplist

using own implementation of chosen method for generating stopowords

selection of textual data and their conversion to appropriate format

choose software allows to realizing text classification and clustering

design and execution of sets experiments

evaluate the effect of stopwords removal to task results
The first two steps are detailed described by author in his previous paper (Krupník, 2013).
Source data
The text data used in the experiments contained opinions in many languages of several
millions customers who – via the on-line Internet service – booked accommodations in many
different hotels and countries. The review texts have two parts – a negative and positive
experience with hotel, both written in a natural language (Žižka and Dařena, 2011). For
experiments were selected three representatives of widely used languages – English, German,
Spanish, complemented fourth representative – Czech. The second text source used in
experiments is standard collection Reuters-21578, respectively their subset R8 created by
Cardoso-Cachopo (2007). Documents contain newspaper articles released by Reuters, which
in this case is divided into eight classes.
Used software
First of all, the source text must be transformed to internal representation, ie be in appropriate
format. For this purpose was used TextMining program module (Žižka and Dařena, 2010).
Formats for storing of output text file were chosen with regard to requirements of used
software: csv, sparse and rlabel, dat. Own script implementation of automatization generating
stopwords are used for creating stoplist.
Cluto (Clustering Toolkit) is a software package consists of two individual programs
(vcluster a scluster) intended for clustering datasets and for analyzing the characteristics of the
various clusters developed by Karypisem (2003). Through this program was experimented
with k-Means method.
SVM light is an implementation of Support Vector Machine algorithm created by
Joachims (1999) and it is appropriate for the problem of pattern recognition, regression, and
for the problem of learning a ranking function. The code has been used on text classification.
The C5.0 algorithm was tested in SPSS Modeler environment, which is complex set of
tools supporting whole data mining process developed by IBM (2011). It is design based on
CRISP-DM (Cross Standard Process for Data Mining) concept. Modeler is user friendly, has a
graphical user interface and allows realizing all analysis by creating data stream.
Results
Experiments were primarily focused on find the effect of removal stopword to efficiency of
classification and clustering algorithm C5.0, SVM a k-Means. Pridat crossvalidaci
C5.0
The set of classification experiments was performed with input files with these following
characteristics:

minimum word length: 2 characters

minimum frequency of words in all documents: 3

stopwords removal has not been done

data representation: TF-IDF
Stopwords were removed directly in the program data stream in count 200 and 1 000.
Experiments were performed on hotel review collection in all tested languages containing
10 000 documents and on Reuters (R8) collection.
Figure 1 shows a minimal effect of removal words to classifier accuracy. Influence of
extraction 200 words is insignificant, a little bit better influence on accuracy has extraction of
1000 words. But still it is change only in a tenth of percent. More interesting is Figure 2
illustrating the time-saving of model creation. Elimination of 1 000 stopwords and even 200
words has resulted significant save time. Notice that the accuracy is almost same.
SVM
The input files characteristics:

minimum word length: 2 characters

minimum frequency of words in all documents: 3

stopwords removal has been done

data normalization: cosine

data representation: TF-IDF
Parameterization of the classifier was chosen with regard to textual document processing and
it is based on results published by Joachims (1998):

core function: polynomial

degree of polynom: 3
Stopwords extraction has been done during the transformation of data into the vector
representation, always in count 200, 1 000 and for english hotel reviews collection yet in
count 3 000. Collection of the hotel reviews all selected languages containing 10 000 a 50 000
documents and the Reuters collection were tested.
Figure 3 shows influence of removal 200, 1000 and 3000 stopwords. The precision of
classifier wasn't changed importantly. Time of creation model is also independent on
stopwords extraction.
K-Means
The input files characteristics:

minimum word length: 2 characters

minimum frequency of words in all documents: 3

stopwords removal has been done

data representation: TF-IDF
Parameterization algorithm choice was derived from the paper (Žižka et al., 2012):

clustering method: direct (k-Means)

criterion functions: H2

similarity measure: cosine

number of clusters: 2 and 5 (hotel reviews); 8 and 15 (R8)
Extraction of stopwords has been done during the transformation of data into the vector
representation, always in count 200, 1 000 and for hotel reviews written in English yet in
count 3 000. Tests was performed on the hotel reviews collection all selected languages
containing 10 000 and 50 000 documents and on the Reuters (R8) collection. For cluster
validation were used entropy and purity as the frequently used external measures in evaluating
documents.
The results of Spanish collection with 2 clusters are reflected on Figure 4. Positive effect of
decrease entropy was measured by using removal of 200 and 1 000 stop words. Only the FFR
method is not useful. In this case number of clusters is equal to number of classes. Different
situation is shown on Figure 5. Number of clusters is higher than number of predefined
classes.
Discussio
n
It
is
not
simple
to
derive global
conclusion
about
stopwords
filtering
and
define the recommendation of removal usage. Primarily it depends on purpose and preference
of specific task, which are for example time of processing, classifier quality or algorithm
usage.
Quality of categorization achieved by decision tree was not influenced by extraction of
generated stopwords. Impact on classifier correctness was slightly positive, maximum
improvement in percentage units. Significant effect of removal was saved time necessary to
model creation, the time decrease was almost predictable and notable – in some cases up to
30%. Appropriate candidates are on all the tested method, except ODDR. So, if exists
assumption of big time difficulty of creation model by decision tree, it is warmly
recommended usage stopwords extraction. Removal is recommended too for tasks oriented to
achieve high measure correctness of classification, that means tasks where is important every
tenth percentage of correctness categorized cases. Suitable methods for these cases are
especially CHI, IG, NGL or MI. Classifier's success increase in tenth of percent may be in
many tasks useless, so then the only added value represent time saving of decision tree model
creation.
Joachims (1998) does point out that the SVM method has ability to learn independent
on space dimensionality, however the results of classification indicates small increase of
correctness at collections contains 10 thousand records and at R8. Similarly to C5.0 was
maximum increase of correctness in tenths of a percent. Improvement was not achieved by
using all methods, the best of tested methods were GSS, CHI, IG and NGL, which reflected
this more often. Performed extraction at the collection containing 50 thousand documents has
slightly negative impact on classifier's quality. Time saving was not measured. Stopwords
removal in this case has minimal added value.
The measured quality of clustering after stopwords filtering was increased in large part
of collections empirically was discovered up to 15% decrease of entropy and 10% increase of
purity. These results were achieved in greater extent by applying methods: CHI, CHI, GSS
and IG. In cases when the effect of improved quality has not been achieved it was not lead to
radically degradation of results. Due to this findings and potentially high opportunity to get
better results of clustering (chance is 70–75% by using mentioned methods), it is possible to
recommend the stopwords extraction.
Conclusion
Research was oriented on analysis of preprocessing text data influence on results of
classification and clustering, with focused on stopwords extraction from textual documents.
Total was performed more than 500 experiments, from obtained results are possible derived a
recommendations, which are mentioned in discussion. The outputs of this research may be
useful for other works and for next researches.
References
CARDOSO-CACHOPO, A, 2007: Improving Methods for Single-label Text Categorization.
PhD Thesis. Lisabon: UTL.
DAVE, K., 2011: Study of feature selection algorithms for text-categorization. Las Vegas:
University of Nevada.
FELDMAN, R., SANGER, J., 2007: The text mining handbook: Advanced Approaches in
Analyzing Unstructured Data. New York: Cambridge. 423 p. ISBN 978-0-521-83657-9.
IBM Corporation, 2011: IBM SPSS Modeler 14.2 Modeling Nodes. 457 p. Available from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/14.2/en/ModelingN
odes.pdf.
JOACHIMS, T., 1998: Text Categorization with Support Vector Machines: Learning with
Many Relevant Features. Berlin: Springer.
JOACHIMS, T., 1999: Making large-Scale SVM Learning Practical. In Advances in Kernel
Methods – Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.).
Cambridge: MIT-Press.
KARYPIS, G., 2003: Cluto: A Clustering Toolkit. Minesota: UMN.
KRUPNÍK, J., 2013: Automatizace generování stopslov. [CD-ROM]. In: PEFnet 2013. ISBN
978-80-7375-669-7.
LI, S., et al., 2009: A framework of feature selection methods for text categorization. In:
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Language Processing of the AFNLP: Volume 2Volume 2. Association for Computational Linguistics. pp. 692-700.
MAKREHCHI, M., KAMEL, M. S., 2008: Automatic extraction of domain-specific
stopwords from labeled documents. In: Advances in information retrieval. Springer Berlin
Heidelberg, pp. 222-233.
RACHEL, TW. L., HE, B., OUNIS, I., 2005: Automatically Building a Stopword List for an
Information Retrieval System. In: Journal on Digital Information Management: Special Issue
on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), pp. 17–24.
SINKA, P. M., CORNE, W. D., 2003: Evolving Better Stoplists for Document Clustering and
Web Intelligence. In: HIS. pp. 1015–1023.
UCHYIGIT, G., CLARK, K., 2008: Personalization techniques and recommender systems.
Singapore: World Scientific. ISBN 978-981-2797-025.
WEISS, S. M., et al., 2010: Fundamentals of predictive text mining. New York: SpringerVerlag, xiii, 226 p. Texts in computer science. ISBN 9781849962261-.
YANG, Y., PEDERSEN, J. O., 1997: A comparative study on feature selection in text
categorization. In: ICML. pp. 412-420.
ŽIŽKA, J., BURDA, K., DAŘENA, F., 2012: Clustering a Very Large Number of Textual
Unstructured Customers’ Reviews in English. In: Artificial Intelligence: Methodology,
Systems, and Applications. Berlin: Springer.
ŽIŽKA, J., DAŘENA, F., 2010: Automatic Sentiment Analysis Using the Textual Pattern
Content Similarity in Natural Language. Lecture Notes in Artificial Intelligence, 6231, 1:
224–231. ISSN 0302-9743.
ŽIŽKA, J., DAŘENA, F., 2011: Mining Significant Words from Customer Opinions Written
in Different Natural Languages. In: Text, Speech and Dialogue. Springer Berlin Heidelberg,
2011. pp. 211–218.
Download