2 LSA and SOM for document indexing

advertisement
Arabic Documents Indexing and Classification Based on Latent
Semantic Analsyis and Self-Organizing Map
Chafic Mokbel, Hanna Greige
University Of Balamand
P.O. Box 100 Tripoli
Lebanon
+961 6 930 250
chaficm@balamand.edu.lb
greige@balamand.edu.lb
Charles Sarraf
University Of Balamand &
Ericsson Lebanon
P.O. Box 55334 – Sin-el-Fil
Beirut - Lebanon
Mikko Kurimo
Helsinki University of Technology
Neural Network Research Center
P.O. Box 5400 FIN-02015 HUT
Finland
Charles.Sarraf@ericsson.com
mikkok@hut.fi
this co-occurrence matrix. On this matrix LSA applies a
singular value decomposition (SVD) in order to reduce the
space dimension while maintaining a maximum of
information. Words and documents are represented by
vectors in this space, called the latent semantic space. As
described above, LSA is a statistical approach that does not
consider explicit linguistic or semantic information about
words or documents, but tries to discover it from the
available database. This idea of inferring semantic
information from data is very attractive, but the task to
accomplish is very complex. To reduce the complexity and
to improve the modeling reliability, it might be helpful to
add some a priori information concerning the words.
Generally, and in order not to bias the measures, the most
common words are taken out of the vocabulary (e.g.
articles), and stemming is performed only maintaining the
roots of the words. Stemming is of special interest for the
Arabic language since all relative and possessive articles
are joined to the words making the number of variants of a
word very important. Avoiding stemming increases the
complexity of the LSA task since it requires LSA to
discover the words root in addition to the semantic
associations. In this work important effort has been devoted
to perform stemming and vocabulary words reduction.
ABSTRACT
This paper describes an Arabic document indexing system
based on a hybrid “Latent Semantic Analysis” (LSA) and
“Self-Organizing Maps” (SOM) algorithm. The approach
has the advantage to be completely statistic and to
automatically infere the indices from the documents
database. A rule-based stemming method is also proposed
for the Arabic language. The whole system has been
experimented on a database formed of the Alnahar
newspaper articles for 1999. Documents clustering and few
experiments in retrieval have provided satisfactory results.
Keywords
Indexing, Latent Semantic Analysis, Self Organizing Map,
Average Document Perplexity, Precision measure.
1 INTRODUCTION
During the last decades networking technologies have been
largely developed and deployed, such development and
expansion are offering reliable infrastructure for real
information highways. However, accessing information is
not so easy and in many cases is not well structured. Thus,
it is of high interest to develop techniques helping to
organize the huge amount of information available over the
networks.
As an alternative or in association to LSA [4], SelfOrganizing Map (SOM) [3] is also experimented for Arabic
document classification and indexing. In addition to its
nonlinear classification ability, SOM offers graphical
visualization that facilitates the study and the interpretation
of the results of indexing and classification of the
documents.In this work, SOM is applied posterior to LSA
in order to compute a smoothed document index. If only
the document words are considered while computing the
index, this index will be sensitive. Thus, using SOM, words
from close documents are also used in computing a
document class index, which reduces the sensitivity.
In order to organize the documents, which are the units of
information, indexing can be used. Indexing allows also an
easy retrieval and the possibility of automatic classification.
Different statistical approaches and techniques have been
proposed to automatically determine documents indexes
from documents databases, and important results have been
achieved in this area during the last years.
In this paper we propose to use the Latent Semantic
Analysis (LSA) [2] approach to associate a stochastic index
with every article in a database of the Alnahar newspaper
1999 articles. Every article in this database constitutes a
document. LSA first generates a co-occurrence matrix of
words and the database documents. It supposes that the
documents and words semantic information is contained in
This paper is organized as follows. LSA and SOM
approaches are presented in section 2 as well as the
combination of these two techniques. The reduction of the
vocabulary size and the stemming applied is discussed in
section 3. Measures to assess indexing techniques are given
1
in section 4. Section 5 describes the database and the
experimental protocol and results. Finally, section 5
concludes and enumerates some perspectives.
2 LSA AND SOM FOR DOCUMENT INDEXING
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a statistical approach
that was originally designed to improve the effectiveness of
information retrieval by performing indexing based on
semantic content of words as opposed to direct words
matching [2]. For LSA, a document is defined as a set of
words with no importance to their orders. A set of
documents, called the document database, is formed from a
set of words, called the vocabulary words. As a first order
approximation only the relative occurrence of the words in
the different documents is considered to represent the
semantic information in the database. The whole system is
then defined with the word-document co-occurrence
matrix, in which each row represents a word and each
column represents a document. The elements of the matrix
contain the frequencies with which the words appear in the
documents. Row vectors represent the different words, and
2 words are considered to be close semantically if the
Euclidean distance between their vectors is relatively small.
Actually, the words that appear in the same documents are
semantically tightly related. In parallel, documents with the
same words would also be semantically related.
Looking at this matrix, the documents appear as vectors in
a high-dimensional space where the orthonormal axes
represent the vocabulary words. The same reasoning can be
hold in the other sense, where words are vectors in the
documents space. Consider the words vectors, i.e. the rows
in the word-document co-occurrence matrix. If we multiply
the co-occurrence matrix by its transpose, we obtain the
correlation matrix of the words. This matrix indicates how
much words are correlated, or carry the same information.
If an eigenvalue decomposition is performed on this matrix,
the most significant information axes can be chosen and
words can be projected in this informative subspace.
Performing the eigenvalue decomposition of the correlation
matrix is equivalent to performing the singular value
decomposition (SVD) of the co-occurrence matrix. The
most informative singular vectors are then considered
defining a latent semantic subspace. The neglected less
informative singular vectors are supposed to carry noise
and redundant information due to style, confusion … In the
latent semantic space, the words are represented by vectors
and every document is considered as the vector summation
of the document words vectors. The document vectors
define the indices. Hereafter, the different steps to compute
the latent semantic space are provided.
First, the word-document co-occurrence matrix, noted A, is
computed. Next, a singular value decomposition (SVD)
technique is applied on A. A can be decomposed into the
product of three other matrices:
A = U S VT
(1)
where U and V are the matrices of left and right singular
vectors matrices and S is the diagonal matrix of singular
values, i.e. the non-negative square root of the eigenvalues
of A.AT. The first columns of U and V define the
orthonormalized eigenvectors associated with the non-zero
eigenvalues of A.AT and AT.A respectively. By choosing the
n largest singular values the space can be furthermore
reduced, eliminating some of the noise due to style or noninformative words:
An = Un Sn VnT
(2)
In this n-dimensional space the ith word wi is encoded:
xi = ui Sn / ||ui Sn||
(3)
th
where ui Sn is the i row of the matrix Un Sn.
In order to evaluate the semantic dissimilarity between
words, a distance measure such as the simple Euclidean
distance can be used. Using this distance, the document
dissimilarity can be measured. Actually, a document is seen
as the set of included words and is thereby represented in
the reduced latent semantic space by a vector equal to the
sum of the words vectors. The document vector in the
semantic space defines the document index. Based on the
documents and words indices, a simple Euclidean distance
permits to compare a word to a document or two different
documents. A clustering algorithm or a search algorithm
can be based on this distance.
At this stage it is important to notice that the costly
computation of the SVD can be done off-line and the
obtained indices can be used on-line in any classification or
search algorithms. Let us discuss a main practical issue
concerning the implementation of the LSA algorithm. SVD
is the heart of the LSA approach which is a statistical
technique requiring huge databases for accurate estimation.
Implementing the SVD for indexing on large databases is a
hard task due to the high dimensions of the co-occurrence
matrix. Powerful algorithm becomes unpractical in this
case. Fortunately, the main characteristic of the cooccurrence matrix is its sparseness. The density of non-null
elements in such matrix is very weak since little percentage
of the vocabulary words exists in a given document. The
problem of computing the SVD for large sparse matrices
has been addressed and some efficient methods have been
proposed in the litterature. For example, the block Lanczos
iteration is proposed to efficiently perform SVD on sparse
matrices [1]. This method has been used in our
experiments.
Combining LSA with SOM
After the SVD computation, to every document in the
database is associated an index that can be evaluated by
considering the words in that document. If the words that
appear in that document are only used, the computation of
the index is very sensitive. This sensitivity can be avoided
if the neighboring documents can be associated in the
computation of the index. This supposes a clustering of the
documents in the latent semantic space. To achieve this
clustering and consequent smoothed indexing, SelfOrganizing Maps (SOM) might be used posterior to LSA.
This has been proposed in [4] and experimented in order to
cluster the documents in the spoken audio documents
database. Using the SOM, an index can be found for every
document [3], which integrates not only the document
words but also the words of the close documents in the
latent semantic subspace.
3 STEMMING
As discussed above, efficient techniques [1] can be used to
perform SVD decomposition of the large sparse cooccurrence matrix. Even though, for huge databases, the
SVD computation remains very heavy. To illustrate this,
we mention that the number of documents in the alnahar
1999 database is above 40,000 and the number of included
different words is greater than 400,000. Therefore a
reduction of the vocabulary words is necessary. Moreover,
a lot of variants of the same word exist in Arabic since a lot
of pronuns, articles, possessives (…) can be associated to
words or verbs. Considering all the variants as separate
words requires from LSA to automatically discover the root
of the words in addition to their semantic relations.
The simplest and more efficient way to obtain the root of
the different words is to use a digital dictionary delivering
such information. In our work, a grammar was defined and
applied in order to extract the roots for derived words. This
grammar considers that variants of words may be derived
from roots (with at least 3 characters) by possible prefixes
and suffixes that can be added. Examples of these rules are
given in the following:
Derived_al = “‫ ”ال‬. root
Derived_fal = “‫ ”فال‬. root
Derived_al_couple1 = “‫ ”ال‬. root . “‫”ان‬
Derived_al_couple2 = “‫ ”ال‬. root . “‫”ين‬
To implement the root extraction the set of grammar
stemming rules are represented in a graph. This graph
solution facilitates largely the root extraction. A part of this
graph for the nouns is presented in the figure 1. A separate
graph for the verbs is used. However, and in order to avoid
exceptions and special words problems, the root length has
to be greater than 3 characters.
Although this rule based approach has disadvantages
regarding the exceptions and particular words, it has the
advantage to be automatically and quickly applied.
Moreover, a digital dictionary may not contain all the
words we can find in a document database, in particular
some proper names, and in such cases, the unique solution
is based on such grammars. Finally, this grammar based
approximative solution does not cover all the cases, e.g.
particular plurals (“‫)”مجع التكسري‬, but those cases can be
handled by the statistical LSA.
‫ان‬
‫ال‬
‫ين‬
‫كال‬
Root
end
(
‫ون‬
start
‫ابل‬
Figure 1: Part of the stemming graph.
Once the stemming using both words and verbs graphs is
terminated, less informative words are extracted. Pronouns
(e.g., I “‫”اان‬, you “‫”اننت“ ”انتم“ ”انت‬, he “‫”هن“ ”هو‬, they “‫”هم‬
“‫ )”هن‬are examples of such words.
After stemming and getting out less informative words, an
histogram of the remaining words in the database is
computed. The most and less frequent words are taken out.
4 ASSESSING INDEXING TECHNIQUES
Assessing indexing techniques is a very difficult task.
Generally, human experts are asked to prepare a set of test
queries and to select from the database the relevant
documents to each query. Afterwards, the test queries are
passed to a retrieval algorithm and the relevance of
retrieved documents is measured. Two relative measures,
the recall and the precision, are used to quantify the
relevance of the retrieved documents. The recall measures
the proportion of the relevant documents obtained. High
recall means that a maximum of relevant documents have
been retrieved. However, an algorithm retrieving all the
documents will provide a high recall value. In order to
balance the recall measure, the precision counts the
proportion of relevant documents in the retrieved
documents. As a unique measurement, we can use the
average R-precision, which is defined as the average
precision over the top R documents. Applying these
measures on a database requires a heavy preparation
procedure to choose the test requests and the corresponding
retrieved documents.
In [4] we proposed the average document perplexity as a
measure of relevance of a retrieval algorithm. Perplexity
measures have been largely developed to measure the
quality of language models especially in speech
recognition. In that case, the perplexity measures the
predictive ability of a language model to determine next
word given his history. The average document perplexity is
obtained by calculating the coditional probabilities of the
words given the document using the LSA model. Actually,
given the distance between a word and the document
d(w,doc), we can compute a probability density function
that is monotone decreasing function of this distance. Using
this function, the average document perplexity can be
computed as:
ADPP = exp{-Error!}
(4)
5 EXPERIMENTS
Database description
Experiments have been conduction on the Alnahar
newspaper articles of 1999. This is an Arabic documents
database formed of 44,238 HTML files. These files are
preprocessed to remove HTML tags, the headers, the nonArabic words and the ponctations. The headers conatin the
date, number, title, and authors. After preprocessing, the
database includes 423,440 different words. The numbers
of occurrence of these words vary from 1 to 714,342. After
the different stemming techniques, the number of words is
reduced to 31,798. Actually, the fixed stemming based on
the grammars defined in 3 and on cutting the less
informative words reduces the number of words to 31,798.
Only the first 30,000 of the documents are considered in
our experiments. This leads to a co-occurrence matrix of
dimensions (31,798 * 30,000).
LSA and SOM
The co-occurrence matrix contains 6,799,061 non zero
values which gives a density of 0.7% and 227 words per
document in average. The matrix is encoded using the
Hamwell-Boeing algorithm. Block Lanczos SVD from
SVDPAKC [1] is applied to compute the latent semantic
subspace. Once the words vectors defined in the latent
semantic subspace, the documents index vectors are
calculated by summing the vectors of the included words.
As indicated in section 2 SOM is applied on the documents
indices to perform a clustering. This clustering permits to
compute new smoothed indices that consider, for every
document, in addition to its words, the words of the close
documents. As mentioned above, SOM has also the
advantage of possible visualization of the documents
classification or the documents distances to a test query. In
the figure 2 such an answer to the test query “Prost ‫”ااثر‬. On
this figure, the color of a cell indicates how much a
document cluser is close to a query.
6 CONCLUSIONS AND PERSPECTIVES
In this paper we describe the work done in building a
document indexing system for Arabic documents clustering
and retrieval. This system is based on a combination of
LSA and SOM techniques as described in [4]. Specific
stemming and words root finding approaches have been
developed. First experiments in retrieval have shown
satisfactory.
As a perspective, we propose to evaluate the system using
Figure 2: Documents clusters and distances to the
query “Prost ‫”ااثر‬.
both the R-precision and average document perplexity
proposed in section 4. Some effort must be done to get a
query test subset in order to apply the R-precision measure.
Another perspective of this work is to combine more
semantic and linguistic information with the LSA method.
The main idea is to add some known tags in the dimensions
describing the document vectors in the co-occurrence
matrix.
ACKNOWLEDGEMENTS
The authors want to thank the Alnahar journal for
permitting to run experiments on the 1999 database. A
special thanks for Dr. M. Najjar for supporting this research
project.
REFERENCES
1.
2.
3.
4.
5.
6.
Berry M.W., “Large-scale sparse singular value
computations,” Int. J. Supercomp. Appl., Vol. 6, n. 1,
pp. 13-49, 1992.
Deerwester S., Dumais S., Furdas G. and Landauer K.,
“Indexing by latent semantic analysis,” Journal
American Society of Information Science, V. 41, pp.,
391-407, 1990.
Kohonen T., “Self-Organizing Maps,” Springer, Berlin,
2nd extended edition, 1997.
Kurimo M. and Mokbel C., “Latent semantic indexing
by self-organizing map,” Proc. ESCA ETRW workshop
on accessing information in spoken audio, pp. 25-30,
1999.
Renals S., Abberley D., Cook G. and Robinson T.,
“THISL spoken document retrieval,” Proc. Text
Retrieval Conference (TREC-7), 1998.
Ritter H.and Kohonen T., “Self-organizing semantic
maps,” Biol. Cyb. Vol. 61, n. 4, pp. 241-254, 19
Download