Enhancing Text Analysis via Dimensionality Reduction David G. Underhill and Luke K. McDowell Dept. Computer Science, U.S. Naval Academy 572M Holloway Rd, Annapolis, MD 21402 (410) 293-6800 david.g.underhill@gmail.com,lmcdowel@usna.edu Abstract Many applications require analyzing vast amounts of textual data, but the size and inherent noise of such data can make processing very challenging. One approach to these issues is to mathematically reduce the data so as to represent each document using only a few dimensions. Techniques for performing such “dimensionality reduction” (DR) have been well-studied for geometric and numerical data, but more rarely applied to text. In this paper, we examine the impact of five DR techniques on the accuracy of two supervised classifiers on three textual sources. This task mirrors important real world problems, such as classifying web pages or scientific articles. In addition, the accuracy serves as a proxy measure for how well each DR technique preserves the inter-document relationships while vastly reducing the size of the data, facilitating more sophisticated analysis. We show that, for a fixed number of dimensions, DR can be very successful at improving accuracy compared to using the original words as features. Surprisingly, we also find that one of the simplest DR techniques, MDS, is among the most effective. This suggests that textual data may often lie upon a linear manifold where the more complex non-linear DR techniques do not have an advantage. 1 Introduction Individuals, companies, and governments are surrounded by millions of web pages, communications, and other documents that are potentially relevant, yet in danger of being overlooked amongst all the other data. Analyzing such data is complicated by its representation as natural language text, rather than as a structured database record or a numerical measurement. While such documents can be analyzed with sophisticated natural language processing applications, the challenges of fully understanding text and the daunting amount of data often make statistical text mining techniques more attractive. Such techniques typically en- David J. Marchette and Jeffrey L. Solka Code Q20, NSWCDD Dahlgren, VA 22448 (540) 653-1910 {david.marchette,jeffrey.solka}@navy.mil code the content of a document as a vector with thousands of dimensions, one for each useful word in the corpus. One possible approach to simplifying the analysis of such high dimensional data is to apply some form of “dimensionality reduction” (DR). Techniques for DR all seek to take input points and represent them in a much smaller number of dimensions, while retaining important characteristics of the original data. These techniques have been primarily applied to numerical data, particularly data representing geometric shapes. With a few exceptions (see Section 5), they have not been as well studied for textual data. This paper analyzes how traditional and more recent DR techniques might be profitably applied to textual data. In particular, we evaluate five such techniques: Principal Components Analysis (PCA) [8], Multi-dimensional Scaling (MDS) [11], Isomap [5], Locally Linear Embedding (LLE) [15], and Lafon’s Diffusion Maps (LDM) [4]. To measure their effectiveness with text, we perform document classification using both linear and k-nearest-neighbor classifiers. Document classification is a fundamental and useful text analysis task (e.g., for categorizing newly discovered web pages based on previously labeled pages). In addition, our document classification results indicate how well each DR technique has preserved the interesting inter-document relationships. Thus, these results help to predict which DR techniques might be best suited for pre-processing before performing more complex analysis tasks such as Literaturebased discovery [14, 17] or sentiment analysis [13]. Our contributions are as follows. First, we demonstrate that, for a fixed number of dimensions, applying DR before classification can significantly improve accuracy. This enables classifiers to be more efficient and reduces the impact of noisy dimensions on the results. Second, we show the surprising result that some of the simplest techniques perform best. In particular, we find that MDS and the closely related Isomap have the most reliable performance, almost always yielding the best performance compared to other DR techniques. In addition, both MDS and Isomap are frequently able to match or approach the performance of a naive data representation that uses thousands of dimensions when they use only 10-20 such dimensions. Finally, using specially modified versions of one of our data sets, we demonstrate that the advantages of MDS and Isomap are even more pronounced when classification is more difficult. Overall, our results show that simple techniques can be highly effective at reducing textual data while maintaining its most interesting properties. The following section explains the DR techniques that we evaluate. Section 3 outlines our experimental method and Section 4 presents our results. Section 5 compares these findings with related work, and Section 6 concludes. 2 Table 1. Data Sets Name Science News Science & Tech. Google News 1. Principal Components Analysis (Linear) - PCA is a correlation-based technique that chooses a set of representative dimensions called the principal components based on the degree of variation that they capture from the original data [8]. In particular, PCA computes the output points by performing a singular value decomposition (SVD) on the document covariance matrix and then multiplies the resulting eigenvectors with their corresponding eigenvalues. 2. Metric Multidimensional Scaling (Linear) - MDS focuses on preserving distances between pairs of points [20]. The input is a matrix containing pairwise distances between the original points. MDS performs an eigenvalue decomposition on this matrix in a way that embeds the points in a smaller space while maintaining the relative pairwise relationships [11]. 3. Isomap (Non-Linear) - Like MDS, Isomap seeks to preserve the pairwise distances between input points [1]. Its matrix, however, is based on the geodesic distance, which is computed by connecting # Docs 1047 658 3028 Avg. Doc. Size 8000 chars 2500 chars 4000 chars points in a computed nearest-neighbors graph. This process can more accurately represent some local patterns [5]. The final result is produced by running MDS on the modified distance matrix. 4. Lafon’s Diffusion Maps (Non-Linear) - LDM performs non-linear transformations on an initial interpoint distance matrix in a way that helps accentuate local relationships [4]. In particular, LDM tries to more strongly connect those points that are connected by multiple paths in a nearest-neighbors graph. As with PCA, the final step is a singular value decomposition. Background Dimensionality reduction can take many different forms. In feature selection, DR consists simply of eliminating less significant features from the data set, e.g., by information gain [21]. We focus instead on DR for feature extraction, where new features are created based on some (possibly complex) transformation of the input features. DR methods can also be divided into supervised and unsupervised techniques, where the supervised techniques are informed by labels associated with the input instances (or documents). Because we wish to facilitate text mining applications where such labels are often not known (e.g., with the aforementioned sentiment analysis), we focus on unsupervised DR techniques. Finally, a DR technique is linear if a linear transformation converts an input data point to the reduced feature space; otherwise, the technique is non-linear. We consider the following unsupervised DR techniques: Categories 8 7 5 5. Locally Linear Embedding (Non-Linear) - LLE uses geometric intuition and an assumption that high-dimensional data actually resides on some low-dimensional manifold within the large input space [15]. A reduced embedding is found by translating, rotating, and scaling the existing data based on weights that maintain geometric properties present in a graph of nearest neighbors. The transformed data is then reduced via an eigenvalue decomposition. 3 Methodology In our experiments, we begin with a collection of documents, encode the documents into a term-document matrix, then perform dimensionality reduction on this large matrix. Using the reduced matrix, we then classify each document into one of several known categories. Finally, we evaluate the results. Below we elaborate on each of these steps. 3.1 Data Sets Table 1 shows the three datasets that we used for our experiments. The Science News [16] and Google News (drawn from articles on news.google.com) corpuses each contain over 1,000 articles distributed among a number of well-defined categories. S&T [16] is much more difficult to classify because there is much less information per article - some articles only contain 2-3 sentences. Furthermore, S&T has the smallest ratio of articles to categories. 3.2 Document Encoding To facilitate further processing, our documents must first be encoded. Typically for text mining, a corpus is encoded similar trends; we report results for one setting that worked well for all DR techniques (k = 9). as a matrix where each document is described by a row in a matrix. There are a number of possible matrices and ways to compute them, but based on previous work [12, 16] we adopt the following straight-forward scheme. In this method, each column is a feature that corresponds to a single word found in the corpus. Cell (i, j) is then the TFIDF [3] score for word j in document i, defined as: t/T Ti,j = lg(D/d) 2. Linear Classifier: Our second classifier assumes that each feature is normally distributed, and makes a classification decision based on a linear combination of the features. In particular, the likelihood of document i being in category k is computed as: f (xi , ĉk ) = where t is the number of times word j appears in document i, T is the total number of words in document i, d is the total number of documents that contain word j, and D is is the total number of documents in the corpus. The resulting matrix is known as a (weighted) Term-Document Matrix (TDM). In our implementation, before computing the TDM we perform word stemming and also discard any word that does not occur at least three times in the corpus. 3.3 3.4 Classifiers We evaluated with three commonly-used classifiers: 1. k-Nearest Neighbor Classifier (kNN): kNN assigns a category to a document based on the class(es) of the k closest instances (nearest neighbors) in the training data as defined by a similarity function. Our implementation computed similarity using cosine distance and picked the most likely category using similarityweighted voting among the closest neighbors. Experimentally, we found that different values of k produced (2π) 1 · |Σ| 1 2 ·e− 2 (xi −µk )×Σ −1 ×(xi −µk )T where N is the number of features, xi is the row vector representing the document being classified, µk is the vector of feature means for category k from the training set, and Σ is the inter-feature covariance matrix. For the linear classifier, the same Σ is used for every category, which results in two categories being separated by a hyperplane in the feature space. 3. Quadratic Classifier: The quadratic classifier is like the linear classifier, but uses category-specific covariance matrixes. Consequently, two categories are separated by more complex non-linear functions in the feature space. We found similar trends as with the linear classifier and omit details for lack of space. Dimensionality Reduction The encoded TDM has 600-3000 rows (one for each document) and 5000-11000 columns (one for each stemmed word in the corpus). We next use dimensionality reduction to significantly reduce this number of columns (or features). In our experiments, we use the five DR techniques described in Section 2. We implemented each technique in Java, then validated each against existing Matlab or R code that we obtained from others. Isomap and LLE require a choice of how many closest neighbors to consider when constructing the nearest neighbor graph; experimentally we found that using 10 neighbors yielded good results. Each technique outputs the most significant dimensions first, enabling us to examine the impact of only using the first M dimensions for classification. In addition, we also compared against two variants that performed no DR. First, None-rand randomly selects and uses M features from the unmodified TDM matrix. Second, None-sort also uses M unmodified features, but selects the N features with the highest average TF-IDF score (the more significant words). These two variants highlight the potential classification performance easily obtainable without performing any DR. 1 N 2 We measure classification accuracy, which is the percentage of documents that were assigned to the correct category. All experiments used a leave-one-out procedure for testing. 4 Experimental Results Table 2 shows the classification accuracy achieved (as a percentage) after encoding a corpus, performing DR, and then applying the kNN classifier (Table 3 gives results for “Linear”). Each row gives the results for one DR technique, or when using None-sort or None-rand (see Section 3.3). Within a corpus, the column labeled “All” shows the results obtained when the classifier is provided with all available features for each document. For None-sort or Nonerand, this is every (stemmed) word (5000-11000 depending on the corpus). For the DR techniques, “All” corresponds to using every dimension considered significant enough to be produced by the DR algorithm; this varies from 700 to 3300 dimensions depending on the technique and corpus. In addition, since each DR technique outputs the final dimensions sorted by their perceived significance, we also present results where the classifier uses only the first 5, 10, 20, 50, 100, or 500 dimensions. For instance, reducing Science News with PCA, then classifying using kNN with only the best 20 dimensions yields an accuracy of 81%. Results within 2% of the best for each column are shown in bold. Before presenting our primary results, we first explain a few trends. First, performance generally increases as Num. dims. PCA MDS Isomap LLE LDM None-sort None-rand Num. dims. PCA MDS Isomap LLE LDM None-sort None-rand 5 77 72 70 65 65 34 13 5 77 63 66 49 59 29 08 10 81 78 76 70 74 41 14 Table 2. Results with the kNN classifier Science News Science & Technology 20 50 100 500 All 5 10 20 50 100 500 All 81 80 76 47 39 33 45 48 52 48 29 11 78 78 77 78 78 56 59 60 61 59 64 64 76 76 77 73 73 57 59 61 60 61 58 58 75 75 74 28 17 26 26 25 24 24 23 22 76 75 75 17 19 23 26 24 28 34 13 06 54 66 67 75 78 27 34 41 47 53 59 61 17 22 36 43 78 12 14 15 24 26 39 61 5 24 82 83 78 22 30 18 10 25 86 86 80 23 35 19 Google News 20 50 100 500 26 29 32 34 88 89 89 88 86 87 87 85 83 83 84 82 26 57 60 55 50 68 73 83 23 27 38 57 All 21 88 84 72 51 89 89 10 82 74 71 60 71 39 19 Table 3. Results with the Linear classifier Science News Science & Technology 20 50 100 500 All 5 10 20 50 100 500 All 84 87 93 99 99 31 37 45 58 65 95 95 77 84 87 98 99 50 58 64 72 79 99 99 77 81 83 99 99 51 57 60 67 76 95 95 70 76 81 96 99 17 18 31 36 41 89 89 76 82 86 98 99 22 21 25 34 53 90 90 48 51 69 98 17 23 30 43 54 68 99 07 16 34 43 91 17 09 12 15 34 43 72 07 5 20 78 78 54 25 38 28 10 22 79 80 58 26 46 25 Google News 20 50 100 500 23 31 34 38 82 85 89 95 80 82 85 93 64 70 75 87 27 47 53 62 54 68 76 91 27 39 47 72 All 61 99 98 97 77 39 39 the number of dimensions increases. When the number of dimensions is very large (500 or “All”), however, performance may decrease, particularly with kNN. We note that kNN performance in this region would likely improve with a weighted similarity function, though the general trends among DR techniques would be the same. Linear is less susceptible to this problem, because it naturally detects when a particular feature is not helpful in discriminating among the classes. Thus, even when using all dimensions output by a particular DR technique (at most about 1000), performance is maintained or improved. However, even Linear can get overwhelmed by many noisy dimensions [18], as occurs with None-sort and None-rand, where “All” corresponds to 5000-11000 dimensions. Result 1. Compared to other DR techniques, MDS and Isomap yield the most consistent and reliable classification performance. On S&T and Google News, MDS and Isomap dominate. For instance, with 100 dimensions MDS and Isomap improve on other DR techniques by between 1137% for kNN and 11-38% for Linear. For these data sets, MDS is always within 2% of the best results, and Isomap almost always does just as well. Science News is somewhat of an exception. For this corpus PCA yielded the best performance, for both kNN and Linear, especially when the number of dimensions is small. For instance, with just 5 dimensions, PCA achieves 77% accuracy with Linear, while the next best is 66% with Isomap. However, MDS and Isomap improve more rapidly as the number of dimensions increases. Hence by 50 dimensions MDS is within 3% of PCA, and Isomap is not far behind. Furthermore, with kNN, MDS and Isomap do not exhibit large drops in performance when using “All” dimensions, unlike PCA, LLE, and LDM. In general, performance with kNN on DR-processed data starts to drop after 100 dimensions, because the newer (less significant) dimensions have more noise. Since our classifier uses an unweighted similarity function, these dimensions have as much impact as the first dimensions, and the noise causes accuracy to decrease. The MDS algorithm, however, naturally recognizes when there is little benefit to be added from additional dimensions, and as a result they take on very small values that have little impact on the similarity calculations. Isomap utilizes MDS as part of its processing and thus inherits this benefit. The net effect is that MDS and Isomap are very reliable performers across the whole range of reduced dimensions. In contrast, PCA and LDM, the two SVD-based techniques, both perform very poorly on S&T and Google News when the number of dimensions is small. These techniques appear to be more sensitive to the smaller amounts of data per instance in these data sets (see Table 1). Despite this common property, the two data sets yield significantly different performance. On Google News, peak performance for both kNN and Linear reaches over 80% with just 10 dimensions. This can be attributed to the data set having just five well-separated categories evenly split among over 3,000 articles. Science and Technology, on the other hand, has seven categories and is unevenly distributed over just 658 articles. As a result, S&T needs 500 dimensions to get over 80% accuracy with Linear and peaks at 64% for kNN. Result 2. Compared to not performing DR, applying DR before classification can greatly improve the accuracy obtainable with a given number of dimensions. Using MDS or Isomap almost always improved accuracy compared to None-sort or None-rand with the same number of dimensions. For instance, with Linear on Science News with 100 dimensions, using the raw words (None-rand) yielded an accuracy of just 43%. Choosing 100 good words with None-sort improved accuracy, but only to 69%. MDS or Isomap, however, reached 83-87%. For PCA, LLE, and LDM, this improvement held true for Science News, but usually not for the more challenging S&T or Google News. Result 3. The best DR techniques can enable very strong accuracy using a small number of dimensions. With kNN, MDS with just 10-20 dimensions achieves nearly the same accuracy obtained by any DR or no-DR variant with any number of dimensions. For instance, MDS with 20 dimensions on Google News has an accuracy of 88%, compared to the best found with kNN of 89%. Isomap behaves similarly. Linear requires many more dimensions (500 or more) to reach its peak performance, though that level is always higher than with kNN. However, MDS and Isomap can still achieve very solid performance with fewer dimensions. For instance, with 100 dimensions MDS reaches 87% for Science News, 79% for S&T, and 89% for Google News. Result 4. MDS and Isomap’s advantages may be even more pronounced on more difficult data sets. To further investigate the situations where MDS and Isomap perform best with regards to classification, we created three variants of the Science News data set. SN2 is a subset of Science News that contains only Medicine and Astronomy articles two well-separated categories. SN4-Sep contains four fairly well separated categories: Astronomy, Medicine, Earth Sciences, and Life Sciences. Finally, SN4-OL contains four categories with collectively more topic overlap: Medicine, Life Sciences, Behavior, and Anthropology. In general, we found consistent results as before. Table 4 summarizes the results obtained when using 50 dimensions. SN2 is a much easier task, and almost every technique gets near perfect accuracy. SN4-Sep is somewhat harder, but still yields better results than with the original Science News, and every DR technique yields an accuracy above 80%. With the more difficult SN4-OL, however, PCA drops from 86-89% to just 31-43%. LDM suffers similar losses. MDS and Isomap, however, decline by only 1-4%. So while every DR technique did well on the easier SN2 and SN4-Sep corpuses, only MDS and Isomap maintained accuracy over 80% on SN4-OL (though LLE was close). Thus, when combined with our results from S&T and Google News, we observe that MDS and Isomap perform at or near the best observed levels, but particularly dominate when the the analysis is complicated by shorter documents or less well-separated categories. Table 4. Sci. News Variations (# dims. = 50) Corpus PCA MDS Isomap LLE LDM None-sort None-rand 5 kNN SN2 SN4-Sep SN4-OL 98 99 99 99 98 99 85 86 85 82 84 82 81 68 43 81 83 78 44 69 44 Linear SN2 SN4-Sep SN4-OL 99 99 99 99 99 98 83 89 89 88 87 89 87 74 31 89 86 84 50 76 54 Related Work and Discussion As previously discussed, DR has frequently been applied to data sources such as images [1, 5, 15] and biological data [7]. However, the uses of DR for text, particularly the sorts of unsupervised DR methods that we explore in this paper, have not been nearly as extensive. For instance, van der Maaten et al. [19] present a thorough analysis of classification performance using twelve DR techniques on seven data sets, but only one data set is textual. In addition, they do not examine the impact of different numbers of dimensions for the text corpus, do not consider perturbing the corpus, and do not evaluate MDS. Nonetheless, it is useful to make some limited comparisons between our results. We both find that more complex techniques such as LLE and LDM do not usually improve over simpler techniques like PCA. However, PCA performs very well for their data sets, particularly on the one text corpus (where it is the best unsupervised technique). Our results suggest that PCA may indeed perform very well on some textual data (e.g., Science News), but poorly on more challenging data sets such as S&T. Kim et al. [10] describe alternative DR techniques that can reduce the number of features needed to perform text classification. These techniques, however, require that the words in the data set already be clustered together into logical groups. Karypis and Han [9] explored how to expand concept indexing, an alternative DR technique, to exploit known classes of documents. For this one supervised technique, they demonstrate a small improvement in classification accuracy when using the reduced feature space. Bingham and Mannila [2] compare a PCA-like DR technique against a computationally efficient approach based on random projection, and find that the simpler approach is able to achieve comparable performance. However, for text they evaluate only one DR technique, use only one corpus, and evaluate distortion error rather than classification accuracy. In contrast to the above work, we evaluate classification accuracy using five DR techniques on three distinct corpuses. In addition, we do not require that any class labels or word cluster information be provided to the DR algorithms. 6 Conclusions This paper has examined how pre-processing with dimensionality reduction could improve text analysis, using classification accuracy to measure performance. Of the five DR techniques that we considered, all were able to achieve substantial improvements compared to not performing DR, under some conditions. However, MDS and the closely related Isomap proved to be the best overall performers. These two techniques consistently out-performed the noDR variants that we considered, and were able to achieve strong accuracy using just a small number of dimensions. MDS’s ability to out-perform more recent and more complex techniques such as LDM was surprising. This suggests that, at least for the corpuses we considered, this textual data lies on a linear manifold for which the more complex non-linear techniques do not hold an advantage. In addition, we found that LDM, as well as older techniques such as PCA and LLE, had particular trouble with corpuses where the documents were smaller or belonged to less well separated categories. Both MDS and Isomap, however, maintained good performance under these situations. Future work should verify these findings with additional data sets and other classifiers. Additionally, it would be worthwhile to more precisely characterize the corpuses for which MDS and Isomap hold an advantage. The ability to reduce input documents while preserving significant inter-document relationships is important for several reasons. First, having fewer dimensions enables a classifier to function much more efficiently, improving both training and testing time, and also eliminates many noisy dimensions that could diminish accuracy. Second, informative low-dimension representations can be stored and transmitted more efficiently. Third, low-dimension representations enable high-order processing to be more efficient. For instance, discovering previously unknown connections between documents can be accomplished more quickly on reduced data [14]. Furthermore, effective pre-processing with DR may even improve the quality of such analysis tasks by eliminating noise and by identifying distinct terms with related meaning, as with latent semantic analysis [6]. We intend to further explore such possibilities in future work. Acknowledgments. Thanks to the USNA Trident Scholar and ONR In-house Laboratory Independent Research Programs for supporting this work, and to Eric Hardisty and the anonymous reviewers for their helpful comments. References [1] M. Balasubramanian and E. L. Schwartz. The Isomap algorithm and topological stability. Science, 295:7a, Jan. 2002. [2] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] ACM Special Interest Group on Management of Data. ACM Press, 2001. K. W. Church and W. A. Gale. Inverse document frequencies (idf): A measure of deviations from Poisson. In Annual ACM Conference on Research and Development in Information Retrieval. ACM Press, 1995. R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS, 102(21):7426–7431, May 2005. T. Friedrich. Nonlinear dimensionality reduction - Locally Linear Embedding versus Isomap. Technical report, The University of Sheffield - Machine Learning Group, Sheffield S1 4DP, U.K., December 2004. M. D. Gordon and S. Dumais. Using latent semantic indexing for Literature Based Discovery. Journal of the American Society for Information Science, 49(8):674–685, 1998. B. W. Higgs, J. Weller, and J. L. Solka. Spectral embedding finds meaningful (relevant) structure in image and microarray data. BMC Bioinformatics, 7:74, February 2006. I. Jolliffe. Principal component analysis. Technical report, Springer, October 2002. G. Karypis and E.-H. Han. Fast dimensionality reduction algorithm with applications to document retrieval & categorization. In International Conference on Information and Knowledge Management, 2000. H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37–53, b 2005. J. B. Kruskal and M. Wish. Multidimensional Scaling, chapter 1, 3, 5, pages 7–19, 48, 73. Sage Publications Inc., 1978. A. R. Martinez and E. J. Wegman. A text stream transformation for semantic-based clustering. Computing Science and Statistics, 34:184–203, 2002. T. Nasukawa and J. Yi. Sentiment analysis: Capturing favorability using natural language processing. In Int. Conference on Knowledge Capture, Saimbel Island, FL, Oct 2003. C. E. Priebe, D. J. ”Marchette”, and al. Iterative denoising for cross-corpus discovery. COMPSTAT 2004 Symposium, 2004. L. K. Saul and S. T. Roweis. An introduction to locally linear embedding. Technical report, AT&T Labs Research and Gatsby Computational Neuroscience Unit, UCL, 2001. D. J. Solka, A. C. Bryant, and E. J. Wegman. Text data mining with minimal spanning trees. Handbook of Statistics on Data Mining and Visualization, 24, 2005. D. R. Swanson. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7–18, 1986. G. Trunk. A problem of dimensionality: A simple example. IEEE Transactions Pattern Analysis and Machine Intelligence, 1(3):306–307, July 1979. L. van der Maaten, E. Postma, and H. van den Herik. Dimensionality reduction: A comparative review. Draft Version., 2007. F. Wickelmaier. An introduction to MDS. Technical report, Aalborg University (Denmark), May 2003. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In International Conference on Machine Learning, 1997.