Arabic Documents Indexing and Classification Based on Latent Semantic Analsyis and Self-Organizing Map Chafic Mokbel, Hanna Greige University Of Balamand P.O. Box 100 Tripoli Lebanon +961 6 930 250 chaficm@balamand.edu.lb greige@balamand.edu.lb Charles Sarraf University Of Balamand & Ericsson Lebanon P.O. Box 55334 – Sin-el-Fil Beirut - Lebanon Mikko Kurimo Helsinki University of Technology Neural Network Research Center P.O. Box 5400 FIN-02015 HUT Finland Charles.Sarraf@ericsson.com mikkok@hut.fi this co-occurrence matrix. On this matrix LSA applies a singular value decomposition (SVD) in order to reduce the space dimension while maintaining a maximum of information. Words and documents are represented by vectors in this space, called the latent semantic space. As described above, LSA is a statistical approach that does not consider explicit linguistic or semantic information about words or documents, but tries to discover it from the available database. This idea of inferring semantic information from data is very attractive, but the task to accomplish is very complex. To reduce the complexity and to improve the modeling reliability, it might be helpful to add some a priori information concerning the words. Generally, and in order not to bias the measures, the most common words are taken out of the vocabulary (e.g. articles), and stemming is performed only maintaining the roots of the words. Stemming is of special interest for the Arabic language since all relative and possessive articles are joined to the words making the number of variants of a word very important. Avoiding stemming increases the complexity of the LSA task since it requires LSA to discover the words root in addition to the semantic associations. In this work important effort has been devoted to perform stemming and vocabulary words reduction. ABSTRACT This paper describes an Arabic document indexing system based on a hybrid “Latent Semantic Analysis” (LSA) and “Self-Organizing Maps” (SOM) algorithm. The approach has the advantage to be completely statistic and to automatically infere the indices from the documents database. A rule-based stemming method is also proposed for the Arabic language. The whole system has been experimented on a database formed of the Alnahar newspaper articles for 1999. Documents clustering and few experiments in retrieval have provided satisfactory results. Keywords Indexing, Latent Semantic Analysis, Self Organizing Map, Average Document Perplexity, Precision measure. 1 INTRODUCTION During the last decades networking technologies have been largely developed and deployed, such development and expansion are offering reliable infrastructure for real information highways. However, accessing information is not so easy and in many cases is not well structured. Thus, it is of high interest to develop techniques helping to organize the huge amount of information available over the networks. As an alternative or in association to LSA [4], SelfOrganizing Map (SOM) [3] is also experimented for Arabic document classification and indexing. In addition to its nonlinear classification ability, SOM offers graphical visualization that facilitates the study and the interpretation of the results of indexing and classification of the documents.In this work, SOM is applied posterior to LSA in order to compute a smoothed document index. If only the document words are considered while computing the index, this index will be sensitive. Thus, using SOM, words from close documents are also used in computing a document class index, which reduces the sensitivity. In order to organize the documents, which are the units of information, indexing can be used. Indexing allows also an easy retrieval and the possibility of automatic classification. Different statistical approaches and techniques have been proposed to automatically determine documents indexes from documents databases, and important results have been achieved in this area during the last years. In this paper we propose to use the Latent Semantic Analysis (LSA) [2] approach to associate a stochastic index with every article in a database of the Alnahar newspaper 1999 articles. Every article in this database constitutes a document. LSA first generates a co-occurrence matrix of words and the database documents. It supposes that the documents and words semantic information is contained in This paper is organized as follows. LSA and SOM approaches are presented in section 2 as well as the combination of these two techniques. The reduction of the vocabulary size and the stemming applied is discussed in section 3. Measures to assess indexing techniques are given 1 in section 4. Section 5 describes the database and the experimental protocol and results. Finally, section 5 concludes and enumerates some perspectives. 2 LSA AND SOM FOR DOCUMENT INDEXING Latent Semantic Analysis Latent Semantic Analysis (LSA) is a statistical approach that was originally designed to improve the effectiveness of information retrieval by performing indexing based on semantic content of words as opposed to direct words matching [2]. For LSA, a document is defined as a set of words with no importance to their orders. A set of documents, called the document database, is formed from a set of words, called the vocabulary words. As a first order approximation only the relative occurrence of the words in the different documents is considered to represent the semantic information in the database. The whole system is then defined with the word-document co-occurrence matrix, in which each row represents a word and each column represents a document. The elements of the matrix contain the frequencies with which the words appear in the documents. Row vectors represent the different words, and 2 words are considered to be close semantically if the Euclidean distance between their vectors is relatively small. Actually, the words that appear in the same documents are semantically tightly related. In parallel, documents with the same words would also be semantically related. Looking at this matrix, the documents appear as vectors in a high-dimensional space where the orthonormal axes represent the vocabulary words. The same reasoning can be hold in the other sense, where words are vectors in the documents space. Consider the words vectors, i.e. the rows in the word-document co-occurrence matrix. If we multiply the co-occurrence matrix by its transpose, we obtain the correlation matrix of the words. This matrix indicates how much words are correlated, or carry the same information. If an eigenvalue decomposition is performed on this matrix, the most significant information axes can be chosen and words can be projected in this informative subspace. Performing the eigenvalue decomposition of the correlation matrix is equivalent to performing the singular value decomposition (SVD) of the co-occurrence matrix. The most informative singular vectors are then considered defining a latent semantic subspace. The neglected less informative singular vectors are supposed to carry noise and redundant information due to style, confusion … In the latent semantic space, the words are represented by vectors and every document is considered as the vector summation of the document words vectors. The document vectors define the indices. Hereafter, the different steps to compute the latent semantic space are provided. First, the word-document co-occurrence matrix, noted A, is computed. Next, a singular value decomposition (SVD) technique is applied on A. A can be decomposed into the product of three other matrices: A = U S VT (1) where U and V are the matrices of left and right singular vectors matrices and S is the diagonal matrix of singular values, i.e. the non-negative square root of the eigenvalues of A.AT. The first columns of U and V define the orthonormalized eigenvectors associated with the non-zero eigenvalues of A.AT and AT.A respectively. By choosing the n largest singular values the space can be furthermore reduced, eliminating some of the noise due to style or noninformative words: An = Un Sn VnT (2) In this n-dimensional space the ith word wi is encoded: xi = ui Sn / ||ui Sn|| (3) th where ui Sn is the i row of the matrix Un Sn. In order to evaluate the semantic dissimilarity between words, a distance measure such as the simple Euclidean distance can be used. Using this distance, the document dissimilarity can be measured. Actually, a document is seen as the set of included words and is thereby represented in the reduced latent semantic space by a vector equal to the sum of the words vectors. The document vector in the semantic space defines the document index. Based on the documents and words indices, a simple Euclidean distance permits to compare a word to a document or two different documents. A clustering algorithm or a search algorithm can be based on this distance. At this stage it is important to notice that the costly computation of the SVD can be done off-line and the obtained indices can be used on-line in any classification or search algorithms. Let us discuss a main practical issue concerning the implementation of the LSA algorithm. SVD is the heart of the LSA approach which is a statistical technique requiring huge databases for accurate estimation. Implementing the SVD for indexing on large databases is a hard task due to the high dimensions of the co-occurrence matrix. Powerful algorithm becomes unpractical in this case. Fortunately, the main characteristic of the cooccurrence matrix is its sparseness. The density of non-null elements in such matrix is very weak since little percentage of the vocabulary words exists in a given document. The problem of computing the SVD for large sparse matrices has been addressed and some efficient methods have been proposed in the litterature. For example, the block Lanczos iteration is proposed to efficiently perform SVD on sparse matrices [1]. This method has been used in our experiments. Combining LSA with SOM After the SVD computation, to every document in the database is associated an index that can be evaluated by considering the words in that document. If the words that appear in that document are only used, the computation of the index is very sensitive. This sensitivity can be avoided if the neighboring documents can be associated in the computation of the index. This supposes a clustering of the documents in the latent semantic space. To achieve this clustering and consequent smoothed indexing, SelfOrganizing Maps (SOM) might be used posterior to LSA. This has been proposed in [4] and experimented in order to cluster the documents in the spoken audio documents database. Using the SOM, an index can be found for every document [3], which integrates not only the document words but also the words of the close documents in the latent semantic subspace. 3 STEMMING As discussed above, efficient techniques [1] can be used to perform SVD decomposition of the large sparse cooccurrence matrix. Even though, for huge databases, the SVD computation remains very heavy. To illustrate this, we mention that the number of documents in the alnahar 1999 database is above 40,000 and the number of included different words is greater than 400,000. Therefore a reduction of the vocabulary words is necessary. Moreover, a lot of variants of the same word exist in Arabic since a lot of pronuns, articles, possessives (…) can be associated to words or verbs. Considering all the variants as separate words requires from LSA to automatically discover the root of the words in addition to their semantic relations. The simplest and more efficient way to obtain the root of the different words is to use a digital dictionary delivering such information. In our work, a grammar was defined and applied in order to extract the roots for derived words. This grammar considers that variants of words may be derived from roots (with at least 3 characters) by possible prefixes and suffixes that can be added. Examples of these rules are given in the following: Derived_al = “ ”ال. root Derived_fal = “ ”فال. root Derived_al_couple1 = “ ”ال. root . “”ان Derived_al_couple2 = “ ”ال. root . “”ين To implement the root extraction the set of grammar stemming rules are represented in a graph. This graph solution facilitates largely the root extraction. A part of this graph for the nouns is presented in the figure 1. A separate graph for the verbs is used. However, and in order to avoid exceptions and special words problems, the root length has to be greater than 3 characters. Although this rule based approach has disadvantages regarding the exceptions and particular words, it has the advantage to be automatically and quickly applied. Moreover, a digital dictionary may not contain all the words we can find in a document database, in particular some proper names, and in such cases, the unique solution is based on such grammars. Finally, this grammar based approximative solution does not cover all the cases, e.g. particular plurals (“)”مجع التكسري, but those cases can be handled by the statistical LSA. ان ال ين كال Root end ( ون start ابل Figure 1: Part of the stemming graph. Once the stemming using both words and verbs graphs is terminated, less informative words are extracted. Pronouns (e.g., I “”اان, you “”اننت“ ”انتم“ ”انت, he “”هن“ ”هو, they “”هم “ )”هنare examples of such words. After stemming and getting out less informative words, an histogram of the remaining words in the database is computed. The most and less frequent words are taken out. 4 ASSESSING INDEXING TECHNIQUES Assessing indexing techniques is a very difficult task. Generally, human experts are asked to prepare a set of test queries and to select from the database the relevant documents to each query. Afterwards, the test queries are passed to a retrieval algorithm and the relevance of retrieved documents is measured. Two relative measures, the recall and the precision, are used to quantify the relevance of the retrieved documents. The recall measures the proportion of the relevant documents obtained. High recall means that a maximum of relevant documents have been retrieved. However, an algorithm retrieving all the documents will provide a high recall value. In order to balance the recall measure, the precision counts the proportion of relevant documents in the retrieved documents. As a unique measurement, we can use the average R-precision, which is defined as the average precision over the top R documents. Applying these measures on a database requires a heavy preparation procedure to choose the test requests and the corresponding retrieved documents. In [4] we proposed the average document perplexity as a measure of relevance of a retrieval algorithm. Perplexity measures have been largely developed to measure the quality of language models especially in speech recognition. In that case, the perplexity measures the predictive ability of a language model to determine next word given his history. The average document perplexity is obtained by calculating the coditional probabilities of the words given the document using the LSA model. Actually, given the distance between a word and the document d(w,doc), we can compute a probability density function that is monotone decreasing function of this distance. Using this function, the average document perplexity can be computed as: ADPP = exp{-Error!} (4) 5 EXPERIMENTS Database description Experiments have been conduction on the Alnahar newspaper articles of 1999. This is an Arabic documents database formed of 44,238 HTML files. These files are preprocessed to remove HTML tags, the headers, the nonArabic words and the ponctations. The headers conatin the date, number, title, and authors. After preprocessing, the database includes 423,440 different words. The numbers of occurrence of these words vary from 1 to 714,342. After the different stemming techniques, the number of words is reduced to 31,798. Actually, the fixed stemming based on the grammars defined in 3 and on cutting the less informative words reduces the number of words to 31,798. Only the first 30,000 of the documents are considered in our experiments. This leads to a co-occurrence matrix of dimensions (31,798 * 30,000). LSA and SOM The co-occurrence matrix contains 6,799,061 non zero values which gives a density of 0.7% and 227 words per document in average. The matrix is encoded using the Hamwell-Boeing algorithm. Block Lanczos SVD from SVDPAKC [1] is applied to compute the latent semantic subspace. Once the words vectors defined in the latent semantic subspace, the documents index vectors are calculated by summing the vectors of the included words. As indicated in section 2 SOM is applied on the documents indices to perform a clustering. This clustering permits to compute new smoothed indices that consider, for every document, in addition to its words, the words of the close documents. As mentioned above, SOM has also the advantage of possible visualization of the documents classification or the documents distances to a test query. In the figure 2 such an answer to the test query “Prost ”ااثر. On this figure, the color of a cell indicates how much a document cluser is close to a query. 6 CONCLUSIONS AND PERSPECTIVES In this paper we describe the work done in building a document indexing system for Arabic documents clustering and retrieval. This system is based on a combination of LSA and SOM techniques as described in [4]. Specific stemming and words root finding approaches have been developed. First experiments in retrieval have shown satisfactory. As a perspective, we propose to evaluate the system using Figure 2: Documents clusters and distances to the query “Prost ”ااثر. both the R-precision and average document perplexity proposed in section 4. Some effort must be done to get a query test subset in order to apply the R-precision measure. Another perspective of this work is to combine more semantic and linguistic information with the LSA method. The main idea is to add some known tags in the dimensions describing the document vectors in the co-occurrence matrix. ACKNOWLEDGEMENTS The authors want to thank the Alnahar journal for permitting to run experiments on the 1999 database. A special thanks for Dr. M. Najjar for supporting this research project. REFERENCES 1. 2. 3. 4. 5. 6. Berry M.W., “Large-scale sparse singular value computations,” Int. J. Supercomp. Appl., Vol. 6, n. 1, pp. 13-49, 1992. Deerwester S., Dumais S., Furdas G. and Landauer K., “Indexing by latent semantic analysis,” Journal American Society of Information Science, V. 41, pp., 391-407, 1990. Kohonen T., “Self-Organizing Maps,” Springer, Berlin, 2nd extended edition, 1997. Kurimo M. and Mokbel C., “Latent semantic indexing by self-organizing map,” Proc. ESCA ETRW workshop on accessing information in spoken audio, pp. 25-30, 1999. Renals S., Abberley D., Cook G. and Robinson T., “THISL spoken document retrieval,” Proc. Text Retrieval Conference (TREC-7), 1998. Ritter H.and Kohonen T., “Self-organizing semantic maps,” Biol. Cyb. Vol. 61, n. 4, pp. 241-254, 19