Stopwords removal influence on text mining task results Ing. Jiří Krupník, Department of Informatics, Faculty of Business and Economics, Mendel University, jiri.krupnik@mendelu.cz Abstrakt Příspěvek se věnuje analýze vlivu předzpracování textových dokumentů na výsledky úloh textmining. Konkrétně se zde popisuje analýza vlivu odstranění stopslov, jakožto jeden ze způsobů předzpracování dokumentů. Jsou zde prezentovány a diskutovány výsledky, kterých bylo dosaženo experimenty s rozsáhlými textovými dokumenty pořízené v přirozených jazycích. Zkoumal se vliv na výkonnost výsledky různých shlukovacích a klasifikačních algoritmů. Klíčová slova Stopslovo, klasifikace, shlukování, text mining, textové dokumenty, kolekce, hotelové recenze, Reuters Abstract This paper deals with analysis of preprocessing textual documents influence on text mining task results. One of the possible ways of preprocessing textual documents is stopwords removal. Here are presented and discussed results, which was achieved by experiments with large collection of real-world document written in natural languages. The research was focused on results of different kind of clustering and classification algorithm. Key Words Stopword, classification, clustering, text mining, text documents, collections, hotel reviews, Reuters Introduction Text mining is a relatively new area of computer science research that tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval and knowledge management. The main tasks solving in context of text mining involved text categorization, prediction, clustering, information extraction, sentiment analysis. (Weiss et al., 2010; Feldman and Sanger, 2007). In order to realizing this kind of processing data, it is necessary to transform them into a representation appropriate for the chosen algorithm and possibly can be used some preprocessing operations. Documents written in natural language are characterized by large number of attributes, it means number of unique words (terms). It is obvious, that part of them has not any useful information for mining algorithm. These irrelevant words are so-called noise words and the collection such of these words is stoplist. Involving the stopslist to processing usually leads to decrease of precision and efficiency next tasks. Reduction of the noisy features helps to more effectively clustering and categorization (Sinka and Corne, 2003). This mentioned results are supported by many papers conducted by: Yang and Pedersen (1997), Rachel et al. (2005), Makrehchi and Kamel (2008), Li et al. (2009) or Dave (2011). Goal The research goal was contribute to development of analysis and mining knowledge from textual data written in natural language by performing set of experiments and evaluating obtained results. For this purpose is necessary to create analysis of current approaches generating stopwords and perform set of classification and clustering experiments. Obtained results of experiments must be evaluated and discussed with focused on comparison of single stopwords removal method. To achieve a goal of this work is needed to introduce the text-mining issues and the approaches of preprocessing textual documents, which also include software systems for execution this tasks and experiments. Methods The stopwords extraction is based on automatization generating method, which was defined in many publications of this authors: Yang and Pedersen (1997), Uchyigit and Clark (2008). These methods are design for a labeled data and the following are used for experiments: Odds Ratio (ODDR) Information Gain (IG) F-measure Feature Ranking (FFR) The Chí Statistic (CHI) Mutual Information (MI) Coeficient Ng-Goh-Low (NGL) Coeficient Galavotti-Sebastiani-Simi (GSS) The fundamental text mining tasks solving in research are text classification and clustering. These tasks are tested on modern machine-learning algorithms forming the basis of data mining area. The purpose of classification algorithms is to assign the correct properties (classes) for the instance (sample) from a set of preclassified training documents. It is a supervised machine learning (Joachims, 1998). Chosen classification methods used in experiments are: C5.0 and SVM. Clustering is a process through which objects are classified into groups called clusters. In the case of clustering, the problem is to group the given unlabeled collection into meaningful clusters without any prior information. This concept are called unsupervised machine learning (Karypis, 2003). k-Means method was chosen for testing. Objectives and methodology Purpose of research is found the role of stopword in categorization and clustering tasks. More precisely, discover removal stopwords influence from text documents on the results of the text mining tasks. Data background for outputs and conclusions are result of executed experiments with large textual documents. In general, it is necessary to achieve a goal, performs the following steps: analysis of current approaches of creating stoplist using own implementation of chosen method for generating stopowords selection of textual data and their conversion to appropriate format choose software allows to realizing text classification and clustering design and execution of sets experiments evaluate the effect of stopwords removal to task results The first two steps are detailed described by author in his previous paper (Krupník, 2013). Source data The text data used in the experiments contained opinions in many languages of several millions customers who – via the on-line Internet service – booked accommodations in many different hotels and countries. The review texts have two parts – a negative and positive experience with hotel, both written in a natural language (Žižka and Dařena, 2011). For experiments were selected three representatives of widely used languages – English, German, Spanish, complemented fourth representative – Czech. The second text source used in experiments is standard collection Reuters-21578, respectively their subset R8 created by Cardoso-Cachopo (2007). Documents contain newspaper articles released by Reuters, which in this case is divided into eight classes. Used software First of all, the source text must be transformed to internal representation, ie be in appropriate format. For this purpose was used TextMining program module (Žižka and Dařena, 2010). Formats for storing of output text file were chosen with regard to requirements of used software: csv, sparse and rlabel, dat. Own script implementation of automatization generating stopwords are used for creating stoplist. Cluto (Clustering Toolkit) is a software package consists of two individual programs (vcluster a scluster) intended for clustering datasets and for analyzing the characteristics of the various clusters developed by Karypisem (2003). Through this program was experimented with k-Means method. SVM light is an implementation of Support Vector Machine algorithm created by Joachims (1999) and it is appropriate for the problem of pattern recognition, regression, and for the problem of learning a ranking function. The code has been used on text classification. The C5.0 algorithm was tested in SPSS Modeler environment, which is complex set of tools supporting whole data mining process developed by IBM (2011). It is design based on CRISP-DM (Cross Standard Process for Data Mining) concept. Modeler is user friendly, has a graphical user interface and allows realizing all analysis by creating data stream. Results Experiments were primarily focused on find the effect of removal stopword to efficiency of classification and clustering algorithm C5.0, SVM a k-Means. Pridat crossvalidaci C5.0 The set of classification experiments was performed with input files with these following characteristics: minimum word length: 2 characters minimum frequency of words in all documents: 3 stopwords removal has not been done data representation: TF-IDF Stopwords were removed directly in the program data stream in count 200 and 1 000. Experiments were performed on hotel review collection in all tested languages containing 10 000 documents and on Reuters (R8) collection. Figure 1 shows a minimal effect of removal words to classifier accuracy. Influence of extraction 200 words is insignificant, a little bit better influence on accuracy has extraction of 1000 words. But still it is change only in a tenth of percent. More interesting is Figure 2 illustrating the time-saving of model creation. Elimination of 1 000 stopwords and even 200 words has resulted significant save time. Notice that the accuracy is almost same. SVM The input files characteristics: minimum word length: 2 characters minimum frequency of words in all documents: 3 stopwords removal has been done data normalization: cosine data representation: TF-IDF Parameterization of the classifier was chosen with regard to textual document processing and it is based on results published by Joachims (1998): core function: polynomial degree of polynom: 3 Stopwords extraction has been done during the transformation of data into the vector representation, always in count 200, 1 000 and for english hotel reviews collection yet in count 3 000. Collection of the hotel reviews all selected languages containing 10 000 a 50 000 documents and the Reuters collection were tested. Figure 3 shows influence of removal 200, 1000 and 3000 stopwords. The precision of classifier wasn't changed importantly. Time of creation model is also independent on stopwords extraction. K-Means The input files characteristics: minimum word length: 2 characters minimum frequency of words in all documents: 3 stopwords removal has been done data representation: TF-IDF Parameterization algorithm choice was derived from the paper (Žižka et al., 2012): clustering method: direct (k-Means) criterion functions: H2 similarity measure: cosine number of clusters: 2 and 5 (hotel reviews); 8 and 15 (R8) Extraction of stopwords has been done during the transformation of data into the vector representation, always in count 200, 1 000 and for hotel reviews written in English yet in count 3 000. Tests was performed on the hotel reviews collection all selected languages containing 10 000 and 50 000 documents and on the Reuters (R8) collection. For cluster validation were used entropy and purity as the frequently used external measures in evaluating documents. The results of Spanish collection with 2 clusters are reflected on Figure 4. Positive effect of decrease entropy was measured by using removal of 200 and 1 000 stop words. Only the FFR method is not useful. In this case number of clusters is equal to number of classes. Different situation is shown on Figure 5. Number of clusters is higher than number of predefined classes. Discussio n It is not simple to derive global conclusion about stopwords filtering and define the recommendation of removal usage. Primarily it depends on purpose and preference of specific task, which are for example time of processing, classifier quality or algorithm usage. Quality of categorization achieved by decision tree was not influenced by extraction of generated stopwords. Impact on classifier correctness was slightly positive, maximum improvement in percentage units. Significant effect of removal was saved time necessary to model creation, the time decrease was almost predictable and notable – in some cases up to 30%. Appropriate candidates are on all the tested method, except ODDR. So, if exists assumption of big time difficulty of creation model by decision tree, it is warmly recommended usage stopwords extraction. Removal is recommended too for tasks oriented to achieve high measure correctness of classification, that means tasks where is important every tenth percentage of correctness categorized cases. Suitable methods for these cases are especially CHI, IG, NGL or MI. Classifier's success increase in tenth of percent may be in many tasks useless, so then the only added value represent time saving of decision tree model creation. Joachims (1998) does point out that the SVM method has ability to learn independent on space dimensionality, however the results of classification indicates small increase of correctness at collections contains 10 thousand records and at R8. Similarly to C5.0 was maximum increase of correctness in tenths of a percent. Improvement was not achieved by using all methods, the best of tested methods were GSS, CHI, IG and NGL, which reflected this more often. Performed extraction at the collection containing 50 thousand documents has slightly negative impact on classifier's quality. Time saving was not measured. Stopwords removal in this case has minimal added value. The measured quality of clustering after stopwords filtering was increased in large part of collections empirically was discovered up to 15% decrease of entropy and 10% increase of purity. These results were achieved in greater extent by applying methods: CHI, CHI, GSS and IG. In cases when the effect of improved quality has not been achieved it was not lead to radically degradation of results. Due to this findings and potentially high opportunity to get better results of clustering (chance is 70–75% by using mentioned methods), it is possible to recommend the stopwords extraction. Conclusion Research was oriented on analysis of preprocessing text data influence on results of classification and clustering, with focused on stopwords extraction from textual documents. Total was performed more than 500 experiments, from obtained results are possible derived a recommendations, which are mentioned in discussion. The outputs of this research may be useful for other works and for next researches. References CARDOSO-CACHOPO, A, 2007: Improving Methods for Single-label Text Categorization. PhD Thesis. Lisabon: UTL. DAVE, K., 2011: Study of feature selection algorithms for text-categorization. Las Vegas: University of Nevada. FELDMAN, R., SANGER, J., 2007: The text mining handbook: Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge. 423 p. ISBN 978-0-521-83657-9. IBM Corporation, 2011: IBM SPSS Modeler 14.2 Modeling Nodes. 457 p. Available from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/14.2/en/ModelingN odes.pdf. JOACHIMS, T., 1998: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Berlin: Springer. JOACHIMS, T., 1999: Making large-Scale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.). Cambridge: MIT-Press. KARYPIS, G., 2003: Cluto: A Clustering Toolkit. Minesota: UMN. KRUPNÍK, J., 2013: Automatizace generování stopslov. [CD-ROM]. In: PEFnet 2013. ISBN 978-80-7375-669-7. LI, S., et al., 2009: A framework of feature selection methods for text categorization. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2Volume 2. Association for Computational Linguistics. pp. 692-700. MAKREHCHI, M., KAMEL, M. S., 2008: Automatic extraction of domain-specific stopwords from labeled documents. In: Advances in information retrieval. Springer Berlin Heidelberg, pp. 222-233. RACHEL, TW. L., HE, B., OUNIS, I., 2005: Automatically Building a Stopword List for an Information Retrieval System. In: Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), pp. 17–24. SINKA, P. M., CORNE, W. D., 2003: Evolving Better Stoplists for Document Clustering and Web Intelligence. In: HIS. pp. 1015–1023. UCHYIGIT, G., CLARK, K., 2008: Personalization techniques and recommender systems. Singapore: World Scientific. ISBN 978-981-2797-025. WEISS, S. M., et al., 2010: Fundamentals of predictive text mining. New York: SpringerVerlag, xiii, 226 p. Texts in computer science. ISBN 9781849962261-. YANG, Y., PEDERSEN, J. O., 1997: A comparative study on feature selection in text categorization. In: ICML. pp. 412-420. ŽIŽKA, J., BURDA, K., DAŘENA, F., 2012: Clustering a Very Large Number of Textual Unstructured Customers’ Reviews in English. In: Artificial Intelligence: Methodology, Systems, and Applications. Berlin: Springer. ŽIŽKA, J., DAŘENA, F., 2010: Automatic Sentiment Analysis Using the Textual Pattern Content Similarity in Natural Language. Lecture Notes in Artificial Intelligence, 6231, 1: 224–231. ISSN 0302-9743. ŽIŽKA, J., DAŘENA, F., 2011: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Text, Speech and Dialogue. Springer Berlin Heidelberg, 2011. pp. 211–218.