Latent Semantic Indexing and it`s place in Information Retrieval

advertisement
Latent Semantic Indexing and it’s place in Information Retrieval
By Michael Weller Autumn 2003
1. Abstract
This report investigates Latent Semantic Indexing and its place in Information
Retrieval. It provides an overview of the field of Information Retrieval, and its overall
purpose. The report describes current tools that are available for use in helping with
retrieving information on the Internet and quality measurement techniques used to
determine how successful a technique is at retrieving documents.
The report focuses on Latent Semantic Indexing, explaining what it is, its purpose, the
approach it takes and describes potential uses. It contrasts Latent Semantic Indexing
to other tools available and depicts the benefits and problems associated with the use
of LSI. The report also raises questions regarding the implementation of Latent
Semantic Indexing and how it could be improved.
Page 1 of 19
2. Contents
Latent Semantic Indexing and it’s place in Information Retrieval ................................ 1
1. Abstract .............................................................................................................. 1
2. Contents ............................................................................................................. 2
2.1.
Table of Figures ......................................................................................... 2
3. Discussion Notes ................................................................................................ 3
3.1.
Adding Meaning to Latent Semantic Indexing .......................................... 3
3.2.
Questions for Discussion ........................................................................... 4
4. Introduction ........................................................................................................ 4
5. Information Retrieval ......................................................................................... 4
5.1.
Quality Measurement; Recall, Precision, and Fallout ............................... 5
5.2.
Tools Available for Information Retrieval ................................................ 6
5.2.1.
Binary Matching ................................................................................ 7
5.2.2.
Vector Space Model ........................................................................... 7
5.2.3.
Latent Semantic Indexing .................................................................. 9
6. Latent Semantic Indexing .................................................................................. 9
6.1.
How does LSI work? ............................................................................... 10
6.1.1.
Content Search ................................................................................. 10
6.1.2.
Index Matrix Composition ............................................................... 10
6.1.2.1. Log-Entropy Weighting ............................................................... 11
6.1.2.2. Other Types of Weighting ........................................................... 11
6.1.3.
Term Space Modelling ..................................................................... 12
6.1.4.
Singular Value Decomposition ........................................................ 12
6.2.
Strengths of LSI ....................................................................................... 14
6.3.
Problems with LSI ................................................................................... 14
6.4.
The Uses and Potential Uses of LSI ........................................................ 15
6.4.1.
Relevance Feedback......................................................................... 15
6.4.2.
Information Filtering ........................................................................ 15
6.4.3.
Textual Coherence ........................................................................... 16
6.4.4.
Automated Writing Assessment ...................................................... 16
6.4.5.
Cross-Language Retrieval ................................................................ 16
7. Conclusions ...................................................................................................... 16
8. References ........................................................................................................ 17
9. Bibliography .................................................................................................... 18
2.1. Table of Figures
Figure 1 - Comparing Document Retrieval ............................................................. 5
Figure 2 - Ideal Retrieval Scenario .......................................................................... 6
Figure 3 - Worst Case Retrieval Scenario................................................................ 6
Figure 4 - Representing Document Space ............................................................... 8
Figure 5 - Simplified Vector Representation of the Index Matrix in Figure 4 ........ 8
Figure 6 - Vector Matching...................................................................................... 9
Figure 7 - Simplified Comparison of Term Space Modelling Diagrams ............... 12
Figure 8 - Pictorial Representation of SVD ........................................................... 13
Page 2 of 19
3. Discussion Notes
Latent Semantic Indexing has been found to work surprisingly well at finding more
relevant documents than keyword searching. It even finds documents that do not even
contain the specified keyword. However, it does only use pattern matching based
upon the frequency of content-describing words across a large collection of
documents.
With this found success, one question arises; can Latent Semantic Indexing be
improved outside its patented framework to improve the relevance of the returned
documents? One possible approach is to expand Latent Semantic Indexing to involve
meaning and context.
3.1. Adding Meaning to Latent Semantic Indexing
Although Latent Semantic Indexing is a great improvement over current search
technologies, it lacks the ability to determine whether a document is correct
according to the context and meaning as intended by the creator. Involving this
concept is not going to be easy, especially as years of research within the field of
Natural Language Processing has yet to prove fruitful.
It would be useless to use a search system to look at the meaning of every
document in its collection. This would not only be difficult from a processing
point of view, but the task of getting a machine to understand meaning has not yet
been solved, and is unlikely to do so for many years to come without a major
breakthrough. However, a more feasible approach would be to examine each
document’s abstract / summary. This would require less processing power, and
provides a general overview of the content of the entire document. However, this
overview would need to be constructed in a way which makes its content
unambiguous from a machine’s point of view, depicting the appropriate context.
For example, searching for a document which contains an image of a Jaguar
aircraft through the following query ‘image of a Jaguar’, would provide any
documents related to the Jaguar aircraft, the Jaguar car, the manufacturer and the
cat, that also contain the word image. If the system was able to search through the
abstract of each document to determine which of the different contexts was
appropriate, the search could then be narrowed down. The system would then be
able to prompt the user to clarify the search based upon the available options, and
provide only relevant documents.
For this to be at all feasible, a new standard for publishing information of the
Internet would be required. All documents would need to have a summary /
abstract provided within the metadata of the document, along with the main
keywords. These keywords could be used to provide an alternative method of
collating the documents.
By providing a context-sensitive description of the document, the system could
then verify that all documents found by its search are totally relevant. This could
Page 3 of 19
be implemented using some form of pattern matching between the user’s search
query and the metadata of the documents in the initial search results.
3.2. Questions for Discussion




Would updating web standards help in improving Latent Semantic
Indexing and Information Retrieval?
Is Latent Semantic Indexing the way forward or just a temporary solution
until Natural Language Processing is perfected?
Who should manage the Index Matrix?
There are a variety of possible implementations of Latent Semantic
Indexing, ranging from distributed to centralised systems. Which would
provide the most benefit?
4. Introduction
In the past, when searching for information on a particular topic, the best place to start
looking was the library. Here, the researcher would be able to browse through the
books in a particular section and find what was required, primarily by looking at the
contents and the index.
Whilst this is still available, a wider collection of sources is available thanks to the
development of the Internet, and with more and more users being connected to it. It is
all very well having access to all of these resources, but this is useless unless the
relevant resources can be easily found.
The most well known method of finding resources on the Internet is the use of search
engines, such as Google (www.google.com), AltaVista (www.altavista.co.uk) and
Yahoo (www.yahoo.co.uk). The use of search engines is only one method. There are
also electronic library catalogues and “the grepstring-matching tool in Unix” (Kolda,
T et al 1998, p.1).
With the Internet being so vast, it is essential for users to be able to find information
easily and efficiently. This becomes even more important for users that still use DialUp Internet connections, where the user is charged for the price of a telephone call.
Information Retrieval encapsulates this task and from this, Latent Semantic Indexing
(LSI) is a promising new tool being developed.
5. Information Retrieval
Information Retrieval is oriented around the indexing of documents, most of which
are textual based, but also those that contain images, other multimedia features, such
as video and sound, and may even be bibliographic data, relating to non-electronic
material.
Information Retrieval has existed since the 50s/60s, when the abilities of computers
began to show the potential for library automation.
The main focus of Information Retrieval is the “issues of how to find meaningful
index keywords” (The Everything Development Company 2000, p.1), with the aim of
Page 4 of 19
being able to organise them, allowing a user to use simple queries to find relevant
information. These queries should then be able to return documents (hits) that are
relevant to the search, even including those that the user would not have thought of
looking at.
In Information Retrieval, there are two main quality measures that are used to
measure performance of a system; recall and precision.
5.1. Quality Measurement; Recall, Precision, and Fallout
When looking for documents on the Internet, the ultimate goal is to retrieve the
ones that are only relevant to the topic being searched. To measure how well a
retrieval system has retrieved the results, three factors need to be considered; the
number of relevant documents (R), the number of relevant hits (RH) and the
number of hits (H).
Figure 1 - Comparing Document Retrieval
Recall compares the number of relevant hits to the number of relevant documents.
This is done using the formula:
Recall 
RH
R
where the ideal result is where all the relevant documents are found, producing a
Recall value of 1.
Precision on the other hand compares the number of relevant hits to the number of
hits. This can be done using the formula:
Precision 
RH
H
where the ideal result is where all of the hits returned are relevant, producing a
Precision value of 1.
Although Recall and Precision are the main quality measures, there is one other,
which is not as well known; Fallout.
Page 5 of 19
Fallout is the comparison of the number of Irrelevant Documents (ID) to the
number of hits. This can be done using the formula:
Fallout 
ID
H
where the ideal result is where all of the hits returned are not irrelevant, producing
a Fallout value of 0.
Figure 2 - Ideal Retrieval Scenario
The best Information Retrieval system will be the one that has a Recall value of 1,
Precision value of 1 and a Fallout value of 0, returning all relevant documents and
no irrelevant ones (See Figure 2).
Figure 3 - Worst Case Retrieval Scenario
The worst case scenario is where the search provides hits which are completely
irrelevant (See Figure 3).
5.2. Tools Available for Information Retrieval
Searching for information can be difficult at the best of times. Information
Retrieval has led to a variety of tools being developed to help make the task
easier, some of which are more effective than others.
Page 6 of 19
5.2.1. Binary Matching
Binary Matching, also known as Boolean Matching, is one of the simpler
Information Retrieval techniques. It is based upon searching for keywords and
determining whether they are contained in a document or not.
The simplest format is where the user enters a keyword into a search engine
and a list of documents that contain that specified keyword is returned. Most
search engines demonstrate this “because it is fast and can therefore be used
online” (THOR 1999, p.1).
Binary matching makes use of Boolean queries, allowing the user to specify
connectives, such as AND, OR, and NOT, to improve the search results.
However, binary matching will tend to miss many documents as it only deems
a document to be relevant or irrelevant; there is no partial matching.
5.2.2. Vector Space Model
The Vector Space Model is designed to be an improvement over Binary
Matching by relaxing the restrictions caused by the ‘Relevant or Irrelevant’
concept. It, therefore, allows partially relevant matches to be found.
This modelling approach relies on the principle that the document’s meaning
can be determined by examining the terms (words) that make up the content of
the document.
Vector Space Modelling comprises of three main stages; document indexing,
term weighting and similarity coefficients.
Document Indexing is oriented around the extraction of all of the terms that
describe the content of the document. Many of the words, such as ‘the’ and
‘is’, do not describe the content and so can be excluded from the index list.
Once all of the ‘content-describing’ terms have been found, Term Weighting
is applied. This is done to highlight the more relevant terms.
There are a variety of ways to calculate the weighting for each term. These
include the frequency of each term, collection frequency and length
normalisation. To date, the weighting that provides the best results, in
accordance to the quality measures, makes use of “term frequency with
inverse document frequency and length normalization” (THOR 1999, p.2).
Page 7 of 19
Figure 4 - Representing Document Space
The results of applying the weighting to the index list are used to create the
Index Matrix, “relating each document in a corpus to all of its keywords”
(Belew, R 2000, p.86). A simplified approach is the construction of a table
with the documents listed across the top and the identified content terms listed
down the side (See Figure 4), with the weighting being equal to the number of
occurrences within the document.
Figure 5 - Simplified Vector Representation of the Index Matrix in Figure 4
This matrix can be represented using vectors (See Figure 5), but can be
difficult to visualise as they tend to work in a large number of dimensions
rather than the three that most people accept.
Once the weighting has been applied, the similarity coefficient stage is
applied. This looks at the terms within the query, with the appropriate
weighting applied, and compares them to terms found in the documents. The
score is calculated for each document by the use of either the dot product (See
Figure 6) or cosine similarity.
Page 8 of 19
Figure 6 - Vector Matching
The score that is calculated indicates the relevance of the document according
to the query entered based upon the appropriate weighting approach used, and
can be used to provide some form of ranking of the documents.
5.2.3. Latent Semantic Indexing
There are two main problems with language associated with context and
meaning. One of the problems is where objects can be referenced by multiple
terms. For example, a person giving a speech may be classed as the speaker or
spokesperson. This problem is known as Synonymy. The other problem is
where words can have more than one meaning. For example, the word Jaguar
has several different interpretations that can be applied to it based solely on
that word. It could be a car, a military aircraft, a company or a type of cat.
This type of problem is known as Polysemy.
If these two problems can be reduced, more relevant documents will be
returned than irrelevant ones. Latent Semantic Indexing aims to accomplish
this as it will be seen in Section 6.
6. Latent Semantic Indexing
Latent Semantic Indexing (LSI) was developed originally by Bellcore (now
Telcordia™ Technologies). It is an extension of Vector Space Modelling, designed to
provide search results related to a specified keyword, even if the keyword is not
contained within the document. It achieves this through the use application of
Singular Value Decomposition (SVD) to the index matrix as created by the Vector
Space Model.
LSI compares documents to determine if they contain many common words.
Documents that follow this trend are then deemed to be semantically close. This
approach is suggested by Yu, C et al (2003) to work surprisingly well, and also
generally mimics “how a human being, looking at content, might classify a document
collection” (Yu, C et al 2003, p.1). For this reason, LSI can return relevant results,
even without the documents containing the specified keyword.
One of the impressions that LSI provides is that it is apparently looking at the context
and is showing it’s comprehension of relationships. However, in practice, it does not
Page 9 of 19
do this. It applies pattern matching on word use, and establishes connections based
upon these patterns.
6.1. How does LSI work?
Latent Semantic Indexing is composed of four stages; content search, index matrix
composition, term space modelling, and application of Singular Value
Decomposition. All stages are applied to provide a refined set of search results.
6.1.1. Content Search
Before any processing can be undertaken, the content-describing words need
to be identified. At first glance, this may appear like an impossible task.
The main focus of narrowing down this search is based upon the principle that
natural language contains a lot of redundancy. This means that many of the
words can be removed and the remaining words will still describe the content.
Therefore, the first step is to remove these redundant words. There are many
approaches that could be used to achieve this process, but will generally all
achieve the same principle task.
Initially all words within every document in the collection are collated so that
all important words are available for use. All prepositions, conjunctions and
pronouns are removed along with common verbs and adjectives. Words,
which aid in the readability of the text, such as ‘therefore’, ‘however’ and
‘thus’, are also removed.
This process leaves a list, which is vastly reduced. However, it can be reduced
further by the removal of words that are contained in all documents, as they do
not distinguish the documents from each other. Words that are only contained
in one document are also removed, as they do not allow related documents to
be found. This provides the final list of content-describing words.
6.1.2. Index Matrix Composition
Once all of the content-describing words have been found, the index matrix,
which is also known as the term-document matrix, can be created. This index
matrix is the same as that created by the Vector Space Model (See Section
5.2.2).
As for the Vector Space Model, there are a variety of methods available to
calculate the term weighting. However, Latent Semantic Indexing usually
makes use of a “local and global weighting scheme” (Berry, M 1996, p.1).
The local weighting adjusts the relative importance of the terms within the
documents, whilst the global weighting adjusts the relative importance across
the collection of documents. These weightings are only applied to non-zero
elements within the matrix, and are done so by multiplying the global and
local weightings together.
Page 10 of 19
Mathematically, the resulting value for each element (a) is as follows:
atd  fL(t , d ) * fG(t )
where fL() is the local weighting function, fG() is the global weighting
function and t is the term in document d.
From this it can be seen that fL(t,d) is the function that calculates the relative
importance of the term t within document d, as both the term and the
document need to be supplied. fG(t), however, only requires the term t to be
supplied, and it can be seen that this function calculates the relative
importance of the term across the entire collection.
There are a variety of different local and global weighting functions that have
been tried. The most advantageous scheme over the use of raw term
frequency was found by Dumais (cited in Berry, M 1996, p.1) to be the logentropy weighting scheme, providing a 40% advantage.
6.1.2.1.Log-Entropy Weighting
Log-entropy weighting is a local weighting function. It relies upon
dividing each of the columns values by the entropy of the column values.
The log of the result of this calculation is then taken to provide the
required weighting.
6.1.2.2.Other Types of Weighting
There are a variety of different weighting methods, which are classified as
either a local or a global weighting function.
Local weighting functions include term-frequency and binary weighting.
Binary weighting is based upon highlighting whether a term exists within a
document or not. The values of the matrix consist either of a 1 or a 0,
where 1 means that the term is in the document. Term-frequency takes
binary weighting one stage further. Rather than just stating whether a term
is contained within a document, this approach creates a matrix which
depicts how many times the term occurs in a given document.
Global weighting functions include GfIdf, and Inverse Document
Frequency (IDF) weighting.
IDF weighting is oriented around the “observation that people tend to
express their information needs using rather broadly defined, frequently
occurring terms” (Belew, R 2000, p.84). However, it is usually the less
frequently occurring terms that are more important, and so IDF makes use
of this fact to provide an alternative weighting approach. GfIdf is a
variation of IDF, which applies the principle of IDF and multiplies it by
the global term weighting.
Page 11 of 19
6.1.3. Term Space Modelling
Term Space Modelling is oriented around providing a graphical representation
of the Index Matrix. The process undertaken is the same as undertaken in
Vector Space Modelling (See Section 5.2.2). There are a variety of methods
accomplishing this modelling task, but all entail the representation of the
document in terms of the keywords (terms).
Figure 7 - Simplified Comparison of Term Space Modelling Diagrams
A graph can be created to show each of the documents. Each point on the
graph represents a document, whilst each of the axes represents a unique
keyword. As there are usually many unique keywords within a document, the
creation of a graph can be difficult to visualise as humans are only capable of
easily representing three dimensions (See Figure 7).
As it is difficult to represent, let alone manipulate documents in the scale of
thousands of dimensions, a method needs to be used to reduce this number
down to a more computationally manageable amount. This is done through
techniques, such as Singular Value Decomposition.
6.1.4. Singular Value Decomposition
Singular Value Decomposition (SVD) is an orthogonal decomposition and is
used to reduce the number of dimensions used to represent the documents. It
has been found that eigenfactor analysis is an efficient way to “characterize
the correlation structure among large sets of objects” (Belew 2000, p.156), and
SVD is just one of the techniques used to accomplish this.
Page 12 of 19
Figure 8 - Pictorial Representation of SVD
Singular Value Decomposition splits any rectangular matrix of size mxn into
three components; a mxn matrix (U), a nxn matrix (VT) and a nxn diagonal
matrix (S), which describes the relationship between the mxn matrix and the
nxn matrix (See Figure 7). This is done using the formula
X  USV T
The mxn matrix (U) is composed of columns called left singular vectors
denoted as {uk}. The rows of the nxn matrix (VT) “contain the elements of the
right singular vectors, {vk}” (Wall, M et al 2003, p.1). The diagonal matrix
(S) contains the singular values, which are the elements within the matrix on
the diagonal. These diagonal values are non-zero, whilst all other elements
within the matrix are zero. This implies that
S  diag ( S1 ,..., S n )
The singular vectors are ordered by sorting them from high to low. This
means that the highest singular value is in the upper left index of the diagonal
matrix.
Singular Value Decomposition for a mxm symmetrical matrix (X) is the same
as the result of solving the eigenvalue problem, or diagonalization. SVD
allows the calculation of
l
X (l )   u k sk vkT
k 1
The application of SVD on the symmetrical matrix (X) provides an important
comparison. It is found that X(l) calculated by the above formula is “the
closest rank-l matrix to X. The term “closest” means that X(l) minimises the
sum of the difference of the elements of X and X(l)” (Wall, M et al 2003, p.1).
It is possible to diagonalize XTX to calculate VT and S according to the
formula
X T X  VS 2V T
Page 13 of 19
U can then be calculated by
U  XVS 1
There are a variety of methods that have been implemented by the University
of California and IBM research to calculate the SVD of a matrix. These
methods include:
Householder Reflections and Given Rotations – This approach can be used
if the matrix is small and not very sparse after various processing techniques.
This approach tends to be impractical to use for Internet-based Information
Retrieval, due to its slow speed.
Power Method and Subspace Iteration – The Power method is used to find
the “largest eigenvector and corresponding eigenvector for a square matrix”
(King, O 1999, p.24). Subspace Iteration is based upon the Power Method.
Lanczos Algorithm – This method is used to calculate the singular values of
large matrices, but does require additional computations to find the associated
eigenvectors. There are a variety of variations, which include Full
Reorthogonalization (FRO), Selective Orthogonalization (SO), Scott’s
Othogonalization (SCO), Selective Orthogonalization II (SO2) and Partial
Orthogonalization.
All of these approaches have their advantages and disadvantages, with some
being more computationally viable than others. These approaches have been
experimentally evaluated and improvements have been found by combining
the approaches (See King, O 1999, p.24).
Within Latent Semantic Indexing, SVD is applied to the index matrix. Once
this has been done, the searching can begin. However, it does require a
readily available copy of the matrix U to be kept so that the search queries can
be transformed to make use of the same document dimensions.
6.2. Strengths of LSI
Latent Semantic Indexing provides the ability to find a broader range of relevant
documents due to looking at semantic relationships between groups of keywords
and the use of a high-dimensional representation. The LSI algorithm also
represents both terms and text objects in the same space, allowing all relevant
information to be processed together as a collection rather than unrelated
documents. This representational feature also helps in allowing objects to be
retrieved directly from the query terms.
6.3. Problems with LSI
As with most techniques in Information Retrieval there are advantages and
disadvantages. Although, LSI provides a significant improvement to the number
of returned documents that are relevant, there are a few main disadvantages that
need to be considered.
Page 14 of 19
As LSI makes use of Singular Value Decomposition, every time documents are
added or removed the SVD statistics of the entire collection are affected. How
much the statistics are affected depends on the size of the document collection.
The larger the collection, the less the statistics are affected.
Due to the compression of the document corpus is compressed, the queries must
be “transformed from the original space of “raw” keywords into the reduced kdimensional space” (Belew, R 2000, p.159). This has to be repeated for every
single query that is used. This also means that the matrix (U) used in the
transformation must be kept readily available.
These disadvantages all contribute to the need for extra storage space and extra
computational power. This means that LSI tends to be slower than the
conventional binary matching methods used in search engines.
6.4. The Uses and Potential Uses of LSI
Latent Semantic Indexing is generally thought of as being an improvement over
the current binary matching search engines. However, there are a variety of other
implementations.
6.4.1. Relevance Feedback
The standard search engine only requires users to enter a small number of
keywords to create a query. The larger this supplied list of keywords, the
more irrelevant documents are returned. LSI contrasts this approach by
making use of a larger set of related keywords to improve the recall.
Relevance feedback improves the query supplied by the user by making use of
the terms within relevant documents. This is achieved without increasing the
computational requirements needed to perform the query, and allows the recall
and precision of the results returned to be improved.
6.4.2. Information Filtering
As Latent Semantic Indexing is able to correlate related keywords, one
possible use is to use it for information filtering, where certain types of words
are removed or documents/text containing certain words are removed from
retrieved documents
An application of information filtering is the filtering of spam email. A major
problem with the use of email is the increasing amount of junk mail that users
have to wade through before actually getting to the important stuff. With an
appropriate implementation of LSI, information filtering could be used to
remove all documents that followed a generic structure. Initially, LSI would
be poor at achieving this until its collection of documents became substantial
enough to infer relationships between keywords in unwanted email.
Page 15 of 19
LSI would then be able to isolate these email messages or even rank the users
entire email messages in order of importance, putting all of the junk email to
the bottom of the list.
Information filtering does not need to be limited to the use of helping to
manage spam. It could be used to manage content expressed within chat
rooms, news groups, bulletin boards and family-suitable search engine results,
using a similar system.
6.4.3. Textual Coherence
When documents are written, one of the main tasks is to ensure that it flows
correctly from one topic to another. LSI could be used to determine the
semantic relationship between parts of the document, allowing a picture to be
developed of how the document flows from one topic to another.
6.4.4. Automated Writing Assessment
With a large collection of documents, a student’s report could be analysed to
determine whether any areas of research were missed. This allows the student
to be given feedback and be provided with areas of further research.
LSI could also aid in the task of fully automating academic integrity checking.
With a broad range of documents within its collection, a system could check a
submitted document to determine if any of the content is copied directly from
other sources.
One further area that could be explored is automating the marking of exams
and coursework. LSI could be improved to determine if the correct content
has been covered. Also if extended to include textual coherence and meaning,
the document could be assessed to determine if it makes sense and is readable.
6.4.5. Cross-Language Retrieval
Latent Semantic Indexing is able to find a wide range of resources related to a
given keyword. By being able to translate the keywords, documents in other
languages can also be found, that are still relevant. The system could then be
used to display all documents in any language that the user can read and
understand.
7. Conclusions
Latent Semantic Indexing is a mathematically-based solution to finding documents on
the Internet. It relies upon finding relationships between keywords and inferring a
semantic correlation between them. This approach provides an inaccurate perception
that the system understands the meaning of the documents, allowing it to find ones
that are related.
LSI makes use of a complex mathematical technique known as Singular Value
Decomposition to reduce the number of dimensions within the document space. Due
Page 16 of 19
to this and other computational requirements, LSI tends to be slower than binary
matching techniques, such as the search engine. However, this needs to be weighed
up with the ultimate advantage of an improved success at returning relevant
documents, even if they do not contain the specified keyword. LSI “has been shown
to be 30% more effective in finding and ranking relevant items than the comparable
word matching methods” (Telcordia Technologies Ltd. Undated, p.1).
There is a lot of potential for developing around LSI by incorporating true meaning
analysis. However, if the concept of meaning analysis does ever get accomplished, it
is likely that LSI will be made redundant as more accurate techniques are likely to be
developed.
8. References
BELEW, R (2000) Finding Out About: A Cognitive Perspective on Search Engine Technology
and the WWW Cambridge: Cambridge University Press
This book introduces Information Retrieval, and a variety of tools and approaches associated
with the field. It looks primarily at text retrieval, but aims to do so in an easily to understand
approach.
BERRY, M (1996) 2.2.2 Weighting [WWW] http://www.cs.utk.edu/~berry/lsi++/node7.html
[accessed 11/11/03]
This site provides a brief description of weighting within LSI.
KING, O (1999) Information Retrieval and Ranking on the Web: Benchmarking Studies II
[WWW] http://citeseer.nj.nec.com/cache/papers/cs/12049/
http:zSzzSzwww.trl.ibm.co.jpzSzprojectszSzs7710zSzdlzSztrlrepzSzrt298.pdf/informationretrieval-and-ranking.pdf [accessed 18/11/03]
This document experimentally investigates a variety of SVD approaches in order to provide a
general comparison.
KOLDA, T & O’LEARY, D (1998) A Semidiscrete Matrix Decomposition for Latent Semantic
Indexing in Information Retrieval [WWW] http://web2.infotraccustom.com/pdfserve/get_item/1/S7b0bf9w4_3/SB993_03.pdf [accessed 14/10/03]
This document describes LSI and how it fits into Information Retrieval by providing
comparisons to some of the alternatives. It also describes some of the variations of LSI.
TELCORDIA TECHNOLOGIES INC (undated) Telcordia™ Latent Semantic Indexing
Software (LSI): Beyond Keyword Retrieval [WWW]
http://lsi.argreenhouse.com/lsi/papers/execsum.html [accessed 10/11/03]
This is an executive summary about LSI by Telcordia, the company formally known as
Bellcore.
THE EVERYTHING DEVELOPMENT COMPANY (2000) Information
Retrieval@Everything2.com [WWW] http://www.everything2.com/index.pl [accessed
14/10/03]
This page provides a summary of the query ‘Information Retrieval’ providing a good starting
point for further investigation into the topic.
THOR (1999) Introduction: Boolean Retrieval [WWW]
http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node2.html [accessed 29/10/03]
Page 17 of 19
This page is one of a collection that briefly introduces Information Retrieval. This particular
page provides a brief description of Boolean Retrieval/Boolean Matching
WALL, M; RECHTSTEINER, A & ROCHA, L (2003) Singular value decomposition and
principal component analysis [WWW] http://public.lanl.gov/mewall/kluwer2002.html [accessed
12/11/03]
This website describes Singular Value Decomposition from a mathematical viewpoint, but
helps with the understanding by attempting to simplify the topic as much as possible.
YU, C; CAUDRADO, J; CEGLOWSKI, M & PAYNE, J. S (2003) Latent Semantic Indexing
[WWW] http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm [accessed 21/10/03]
This is just one page out of a collection, which introduces the concept of improving search
engines and how LSI and Multi-Dimensional Scaling can help with this. This particular page
introduces LSI and provides an overview. Pages associated with this document explain how
LSI works and the uses/potential uses for LSI.
9. Bibliography
BERRY, M (1996) 2.2 Latent Semantic Indexing [WWW]
http://www.cs.utk.edu/~berry/lsi++/node5.html [accessed 11/11/03]
BERRY, M (1996) 2.2.1 Term-Document Representation [WWW]
http://www.cs.utk.edu/~berry/lsi++/node6.html [accessed 11/11/03]
BERRY, M (1996) 2.2.3 Computing the SVD [WWW]
http://www.cs.utk.edu/~berry/lsi++/node8.html [accessed 11/11/03]
BERRY, M (1996) 2.2.4 Query Projection and Matching [WWW]
http://www.cs.utk.edu/~berry/lsi++/node9.html [accessed 11/11/03]
BERRY, M (1996) 2.2.5 Relevance Feedback [WWW]
http://www.cs.utk.edu/~berry/lsi++/node10.html [accessed 11/11/03]
CARNELL, T (2000) Investigation into Internet search technology [WWW]
http://www.scism.sbu.ac.uk/inmandw/tutorials/irtutorials/O1.DOC [accessed 29/10/03]
ELKS, A (2000) Information Retrieval Techniques [WWW]
http://www.scism.sbu.ac.uk/inmandw/tutorials/irtutorials/O2.DOC [accessed 29/10/03]
FOLTZ, P & DUMAIS, S (1992) Personalized information delivery: an analysis of information
filtering methods [WWW] http://web2.infotraccustom.com/pdfserve/get_item/1/S7b0bf9w4_4/SB993_04.pdf [accessed 14/10/03]
FOLTZ, P (1990) Using Latent Semantic Indexing for Information Filtering [WWW]
http://www-psych.nmsu.edu/~pfoltz/cois/filtering-cois.html [accessed 14/10/03]
HAMMERSLEY, B (2003) Guardian Unlimited | Online | Time for new search techniques
[WWW] http://www.guardian.co.uk/online/story/0,3605,889299,00.html [accessed 21/10/03]
INMAN, D (2003) Introduction to Information Retrieval London: London South Bank University
JENNINGS, S (2003) Internet Cryptography Lecture Notes - Week 1 London: London South
Bank University
KOLDA, T & O'LEARY, D (2000) Algorithm 805: Computation and Uses of the Semidiscrete
Matrix Decomposition [WWW] http://web2.infotraccustom.com/pdfserve/get_item/1/S7b0bf9w4_5/SB993_05.pdf [accessed 14/10/03]
LETSCHE, T AND BERRY, M (1996) Large-Scale Information Retrieval with Latent Semantic
Indexing [WWW] http://www.cs.utk.edu/~berry/lsi++/ [accessed 29/10/03]
MERIS, G (2000) Latent Semantic Indexing [WWW]
http://www.scism.sbu.ac.uk/inmandw/tutorials/irtutorials/I1.DOC [accessed 29/10/03]
MURTAGH, F (2001) Information Retrieval [WWW]
http://www.cs.qub.ac.uk/~F.Murtagh/csc306/sxb-ho.pdf [accessed 18/11/03]
PERRY LIBRARY (2001) Referencing Electronic Sources London: London South Bank
University
PERRY LIBRARY (2001) Referencing using the Harvard System: frequently asked questions
London: London South Bank University
THOR (1999) Introduction: Latent Semantic Indexing [WWW]
Page 18 of 19
http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node10.html [accessed 29/10/03]
THOR (1999) Introduction: Vector Space Model [WWW]
http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node5.html [accessed 29/10/03]
YU, C; CAUDRADO, J; CEGLOWSKI, M & PAYNE, J. S (2003) Applications of LSI [WWW]
http://javelina.cet.middlebury.edu/lsa/out/lsa_applications.htm [accessed 21/10/03]
YU, C; CAUDRADO, J; CEGLOWSKI, M & PAYNE, J. S (2003) How LSI Works [WWW]
http://javelina.cet.middlebury.edu/lsa/out/lsa_explanation.htm [accessed 21/10/03]
YU, C; CAUDRADO, J; CEGLOWSKI, M & PAYNE, J. S (2003) LSI Example - Indexing a
Document [WWW] http://javelina.cet.middlebury.edu/lsa/out/tutorial.htm [accessed 21/10/03]
YU, C; CAUDRADO, J; CEGLOWSKI, M & PAYNE, J. S (2003) The Term-Document Matrix
[WWW] http://javelina.cet.middlebury.edu/lsa/out/tdm.htm [accessed 21/10/03]
Page 19 of 19
Download