the vector space model- an information retrieval method

advertisement
Terms of Reference
Terms of Reference ............................................................................................................. 1
Abstract …………………………………………………………………………………...2
THE VECTOR SPACE MODEL- AN INFORMATION RETRIEVAL METHOD ......... 2
How the search engines work ............................................................................................. 5
Search Technologies ........................................................................................................... 5
Interactive Search/Feedback ............................................................................................... 6
Expressing documents in terms of index terms .................................................................. 7
Measuring query correlations ............................................................................................. 9
Ranking results.................................................................................................................. 10
Ideal vector space .............................................................................................................. 15
Classification with centroids being the super class........................................................... 17
Term mismatch ................................................................................................................. 19
Spelling mistakes .............................................................................................................. 20
References……………………………………………………………………………..…22
Robert Ntege
The vector space Model
Page 1 of 22
THE VECTOR SPACE MODEL- AN INFORMATION RETRIEVAL METHOD
Abstract:
The nature of this report is to exhaustively describe the vector space model method as
used in information retrieval systems. The author attempts to critically analyse the
strengths of the vector space model while suggesting ways in which to improve its
shortcomings. The computational vector space is at best misused or to put it in another
way underused. The vector space model is fundamentally prone to under achieve in the
information retrieval domain where ambiguities within the indexed documents exist. It
should be noted that ambiguity is a natural phenomenal within natural language. The
strengths and weakness lies in the fact that we strip the entire document collection into
suitable indexed terms and it is this point where the problem takes shape. We assume that
the terms are spread in the vector space in such a way that they are evenly distributed to
maintain the overall picture of the collected terms without implicitly saying what the
terms really are.
The terms used to describe a vector space must be carefully examined such that their
dependency within the document can still be clearly documented. The author attempts to
investigate the different algorithms used to study the vector space. Therefore it will be
observed that there is a little more emphasis on the terms used to form the basic vectors
as opposed to the actual mechanics of the vectors themselves. Perhaps the most important
aspect of this report is that the nature and performance of the vector space model is still
modelled by the user and the shape of the vector space takes shape in terms of the
different information stored in the collection. Therefore studying the vector space and the
shape it might take is the key point to improving its performance in response to a user’s
query.
Robert Ntege
The vector space Model
Page 2 of 22
Discussion.
In information retrieval systems where millions of documents representing information
regarding almost every aspect of human kind, a technique for retrieving individual
elements within a collection is no easy feat. It is therefore thought based to model the
information represented as small units that can be place in a sphere in such a way that
there are relative position is known or can be computed given the different relationships
within this sphere.
The VSM attempts to draw a map of documents using their “pictorial content” when it
attempts to use terms to represent documents. In fact this seems like a plausible way of
modelling information. After all we take pictures of areas and map them out in a very
specific manner that allows to get from place a to place b. Or consider a patient with
breath related problems. If there are special hospitals for each condition, it is then thought
that the user needs a map of some kind to locate amongst these many hospitals, a specific
one for his/her problem.
Unfortunately as mentioned before the information pages are vast and unstructured. The
only commonality within documents that can get us started is the words or terms within
documents. The possible pictures we can paint involve stripping documents to their
atomic content, that is special terms. And it therefore follows that if we know the makeup
of documents can we fully model documents in such a way that is fully representative of
these documents over a given space. Other questions arise:
Questions:
1. Would it be best to construct a vector space in such a way that each term is as far
away form the next as possible?
2. Is there an actual mechanism for modelling documents to achieve maximum
retrieval values for a given user query. That is to say how can we locate each item
within a collection of elements all with the same makeup, without actually
knowing or understanding what each element stands for?
3. Is it the case that Natural Language Processing involves relationships that we
probably don’t necessarily concisely use? There is no explicit representation of
the vectors hence the semantics are completely thrown out of the window hence
left grasping at how important term co-occurrence can represent semantics.
Robert Ntege
The vector space Model
Page 3 of 22
Introduction
Information storage systems are constructed in such a way that the information stored
should be easily accessible. However its easier said than done as it is generally accepted
that once the documents are all mixed up their retrieval can sometimes become as random
as determining the brightest star. In the Internet environment where there is millions of
different information documents stored the task is even greater. A good structure or
representation of the information is one piece of the puzzle. To a certain extent,
constructing logical structures may be relatively achieved while on the other hand
matching a users information request to a given document within the perfect index is
another problem altogether. It is therefore thought best to construct an index of terms
while keeping the users in perspective. This is even more valuable as the vector space
model does not explicitly represent the terms. The vector space model essentially
attempts to represent documents and terms as vectors. This is where we lose some degree
of certainty as where a document may lie in relation to the next in the VSM. Its not far
fetched to assume that each index term is a mapping within a document on the space, this
statement is an attempt to put the vector space within in a plane. It is therefore
conceivable that we shall have a clustered space where documents that represent the same
terms or concepts lie within close proximity. This idea naturally leads us to conceive of
document centroids where the density of documents within a point represents similar
items. A new measure would be to distribute the space in such a way that terms are as far
apart as possible. This assumption lends itself to new contradictions. For instance
imagine an information storage retrieval system of a bank. Where for the purposes of this
example it is strictly money transactions being stored. If a user constructed a vector into
the space, say asking about pork bellies. We can clearly see the system being 100 percent
precise. As this vector will lie well far away from other pairings in the vector space. It’s
on this assumption that we proceed to exhaustively examine the vector space model as
first proposed by Salton (1989) [1]
It’s worth mentioning that the vector space model severely suffers from the “term
mismatch” problem. As we shall later discuss, the VSM takes on terms that are a finite
representation of all the possible basic terms within the space. However as is always the
case one term can be directly interchangeable with another and this is where the VSM
lags a little and needs some mechanism to accommodate synonyms. It would be ideal to
break down a query into simple parts and resolve the concept differences that may be
present. Since term frequency is one of the basses for constructing a vector model, it is
essential that terms are well represented.
The vector space model has a few fundamental assumptions it makes before attempting to
map out the domain. Firstly there is the assumption that some words are more important
than others. Word frequency is thought to be an important aspect as it is thought that the
more a term appears in a document the more significance it represents. Secondly Luhn’s
analysis of content and functional words is put into practice by applying a stop list to
remove all common words. Essentially the stop list includes pronouns, articles, and
connectors etc that are to be stripped from the document. The remaining terms in the
document form the basis for the desired vector space to be modeled.
Robert Ntege
The vector space Model
Page 4 of 22
Brief Introduction to Information retrieval and its problems.
In the past few years the Internet has grown to an unprecedented scale and with it the
information stored has increased to a point that it’s archival and retrieval are now a major
issue of concern. Perhaps by far, the major point of concern is the structure of the internet
which leaves us with only one possible way of collecting information, that is, manual
information gathering is not an alternative, but automated indexing of information is
sought as a default. Automated indexing in the present IR systems has a few problems
that present themselves in terms of strategic positioning, nature of information stored and
finally the kind of users expected to use the IR systems. A good information retrieval
system should be able to take a user query and return as many relevant documents as
possible while rejecting the non-relevant items. This leads to the two common terms used
in IR as a measure of how good the system is, that is the system should have a good
precision and high recall.
How the search engines work
Putting private information retrieval systems aside, and considering the Internet. There is
at least three major parts to a search engine. First there is the spider also known as the
crawler that searches the different servers for new or updated web pages. The information
gathered is stored in an index. How the index is formulated varies from company to
company and the methods for retrieval. Most search engines have an automated index;
the crawler updates the index with the information it gathers. This is usually in form of
key words, which are related to specific documents.
The second part of the search engine is the index. As mentioned earlier this contains
keywords that are related to specific documents on the web. Usually a users query is
directed towards the index where the comparison takes place. Typically the index is
populated in a hierarchical fashion distinguishing between categories and sub-categories.
With automated search engines the categories can be either specified in the Meta tags or
picked out of the title pages. However, where a page is manually submitted like with
Yahoo, the search engine editor determines the category.
The third part of the search engine is probably the most important aspect of it all. It is the
search software that does the actual sifting of pages to determine which pages are
relevant to the user query. The search software pre determines how the index will look
like and how it will be populated. The software also does the ranking of results, which are
displayed on a users screen in a format that is suitable. Another area comes into play with
visualisation techniques used to format and present information in a manner that is
appropriate and “attractive”.
Search Technologies
The basic fundamental of having an IR system is to provide a user with an interface,
which they can interact with to retrieve the information stored. It is therefore the basis of
this report to look at the various ways of retrieving information from the Information
Storage Systems. In this report, the Vector Space model is extensively studied, however,
Robert Ntege
The vector space Model
Page 5 of 22
it’s worth noting that there is a few other search technologies available among which
include the following.
Boolean Search:
A Boolean search is based on the use of keywords. Basically a query is used as the
equivalent of 1. Therefore the index term is either equal to one or zero. If the index term
is true to the query then its returned as a search result. Upon retrieval other criteria is used
to rank the results according to how relevant they are to the query. The usual logical
connectors like AND, OR, NOT are used to make a combination of query terms. Since
query terms are considered letter for letter, its noted here that a spelling mistake would
have to be rectified by the search engine interface otherwise there would be no results to
return unless an index term with a spelling term exists in the index. Its for this reason that
a thesauri or dictionary can be used to either expand or check the query for errors or other
possible meanings. This added feature adds some efficiency to the Boolean Search
technique but generally this is a very limited way of searching for documents. As we are
well aware of concepts being expressed using many terms. Feldman (1999) [5] gives the
classic example of the term football or soccer as used by American dictionaries. The two
words are interchangeable, however if a query were passed to a Boolean Search engine
asking for football articles, it would unfortunately miss those articles that used the word
soccer. Rijsbergen [7] rightly calls this method simple matching, as it is clear that the
function is only looking for commonality between document and query. “In fact, simple
matching may be viewed as using a primitive matching function. For each document D
we calculate |D [[intersection]] Q|, that is the size of the overlap between D and Q, each
represented as a set of keywords.” Where Q is the set of keywords in the query Q.
Interactive Search/Feedback
It is typical of users not to be sure or exact of what they are searching for. However this
can be relatively eased if there is interaction with the search engine. Basically the user
types an initial query from which search engine fast assess the users information needs. If
for example the user realises that the search term he/she is using is a frequent one, then it
might be wiser to narrow down the search or be more specific. If from the initial results
the user gets too many irrelevant hits but with the term used in them, then it would be
prudent to consult a dictionary for a term that may represent the information needs in a
more specific way. Basically the user has the ability to change the search criteria by
trying and modifying until the query is refined to a point where precision and recall are
maximised. The one drawback with this approach is that rarely do users want this kind of
system where the burden lies with them not the search engine. The try and error method
would not be very successful in non-academic environments where sometimes people are
not even so sure of what they are really looking for. However it should be said that if an
Interactive system was made in search a way it is user friendly with nice graphical
presentations then the user may find it less tasking. There is a slight waste of time but this
is at the expense of “precision” if finally the user gets exactly what he/she wants out of
the system. Other technologies exist, cluster based retrieval, cluster representatives, serial
search, matching functions and the VSM that we shall shortly discuss.
Robert Ntege
The vector space Model
Page 6 of 22
Basic concepts of the Vector Space Model
The first step in building a vector space is determining which terms within the documents
will best represent the content of each document in turn. When a set that can exhaustively
represent each and every document within a collection has been achieved, matching each
document to its corresponding set of index terms can then do the indexing. In the VSM,
both documents and index terms are represented as basic vectors in a linear space.
Therefore the space is determined by both the terms used to index and the documents
represented. The occurrence of a term in a document represents part of the document
along that term vector. Therefore the total elements concerning the documents to be
retrieved are modelled as a vector space. The queries are also represented as basic
vectors, which are in turn relatively measured in the vector space. At this point it would
be essential to note a few properties within a vector space. The properties lend
themselves to accommodate new concepts that we perceive of the vector space, since
these concepts can be represented as simple vectors within the vector space. It can be
argued that it’s this area of the vector space that probably needs closer investigation to
determine the efficiency of using VSM. Wong and Raghaven (1984) [2] propose some
important properties of the vector space that in turn help expand the basic vector to a new
one.
Properties:
a) The ability to add any two elements of the system to obtain a new element of the
system
b) The ability to multiply any element by a real number
c) The vectors obey several basic algebraic rules
Expressing documents in terms of index terms
The basic assumption is that all vectors in the space are a unit length and that there is no
explicit representation of terms. It ensues that vectors do not have to start at zero within
the coordinate system of the space. The relative distance between vectors is measured and
preserved. The second implication is that projections of each vector pair are considered
against the space. Its then conceivable that each document is a point in the space
represented by the area where the document vector touches the space. Similar documents
will hence lie close together within the space.
As we shall later on see the index terms are the generating set to the space we desire to
model. Hence the space is a finite space with n vectors mapped out distributively.
As an example lets take a collection of three documents represented by three index terms.
Consider the following diagram representing its vector space.
Robert Ntege
The vector space Model
Page 7 of 22
Diagram(1)
let the documents be represented by a vector
let the query be represented by a vector
let the index terms T1, T2, T3,…Tn be the terms represented by term vectors
.
If “a boy and a dog” is a document A, and “a boy” is document B. The vector space
generated by the documents A and B are the terms (a boy and stray dog) can be
represented as below
A
2
Boy
1
And
1
stray
1
Document A is as follows
A
Boy
And
Stray
2
1
1
1
dog
1
Dog
1
The vector of A is as follows 2,1,1,1,1
The vector of B is as follows:1,1,0,0,0. The plotting of vectors in a given space is
confined to the maximum number of terms the space has. It’s this boundary that creates
Robert Ntege
The vector space Model
Page 8 of 22
the measure of how the terms spread across the documents. The relationship of document
A with B as an image picture is the simple fact that document B is two fifth of document
A.
Term weighting is useful from the perspective that, the more a term appears in a certain
document the more stronger the association of the term and document. The inverse is also
true, the infrequence of term in a other documents emphasises its importance in a
document where it appears. When the terms are normalised documents can be given
scope search that bigger documents don’t score higher than necessary.
For weighting purposes the terms could be assigned a 0 or1 to the power 1 and so on.
This aspect is generally good for establishing importance (frequency measure) of terms
within a document. In our example above the basic vector can be realised as follows.
Since we have five terms the vector would be denoted by five 0 spaces. Whenever a term
appears, it is represented by 1 and the corresponding power relating to its term
occurrence is also assigned.
00000
10010
Let each term be represented by vectors
From the model represented in diagram 1 the basis of the space is term vectors
hence we can conclusively say that the entire space is just a combination
of
. The documents are represented as a linear combination of several term vectors.
We can here note that the vector space is a finite collection of
document d is explicitly expressed by a t dimensional vector.
Therefore each
Measuring query correlations
We can either use the inner product of the vectors or the inverse function of the angle
between corresponding vector pairs.
The user querry will essentially take the shape
The scalar product is used to measure the correlation between d^ and q^
The relationships between the generated coefficients can then be used to rank the
retrieved information documents. In essence we are taking the relationship between the
query terms against the document terms as measured against each basic vector term. The
Robert Ntege
The vector space Model
Page 9 of 22
scalar product is a relationship born out of the projection of vectors onto each other. This
is an important aspect in such a way that it allows n-vector dimensions to take shape.
Since each vector can be expressed implicitly as a function of the vector adjacent to it.
Diagram 2 shows the projection of vector query q against document d. From the above
notation it is conclusive to assume that each document can be expressed as a small area
within the document space represented or expressed in terms of the terms it contains. The
relationships between the different terms will therefore determine how the users query
can align itself in the space.
The projection of one term against another measures direction in that specified term’s
projection. The magnitude is represented as the distance of separation between the two
terms over a specified space. In the case of the VSM for information retrieval this works
fine as long as we assume that the terms are totally independent of each other, but exhibit
some relationships when considered as a set. This is what makes the vector space model
expandable and it is on this basis that we create a type terms that is sufficient to represent
in vector terms.
Diagram 2
The vector space has a dimension of two terms.
The important thing to note in the above diagram is that both terms and documents are
represented as simple vectors that have an associative existence in the space. The
projection of term 1 on term 2 inherently compares the two documents represented in the
space above. This is more the fact since we express documents in terms of simple term
vectors. The basic components of the documents being terms provide a plausible way of
imaging one document onto the other.
Ranking results
Let us represent the vector,
Robert Ntege
The vector space Model
Page 10 of 22
In matrix notation:
Given that:
It is important to note that the above matrix implicitly represents the term occurrence
frequency of the terms within a document without taking note of their relationships. It is a
simple map of where the terms appear in the document and the relationship between them
is not examined at all. But at the same time it is important to note that just like any other
map, it points us exactly to a point of reference within the space. Hence it is conceivable
here to assume that the picture painted has to be viewed from different angles to suit
different needs.
Representing the scalar product of the vector
in matrix terms as below,
It follows that G the measuring matrix would be a perfect 1 as seen from the equation
above
In this representation it is seen that the area within the vector space that the query q
matches against is an identity matrix G. This assumption is drawn from the representation
of A. Since the comparison has to take place along all the basic term vectors. Hence G
can be assumed to be an identity matrix with only one possible value 1. Therefore to get a
list of relevant documents the following ranking equation can be used.
Robert Ntege
The vector space Model
Page 11 of 22
Which is a simple measure of vector q in the direction and magnitude of the space plotted
by vector d.
Where S is the similarity function in terms of the scalar product of d and q, that is
when S is normalised ,
The results can then be displayed as the decreasing order of similarity to the matrix
painted above.
Robert Ntege
The vector space Model
Page 12 of 22
Term correlations
As mentioned above, the terms within documents have a certain correlation over a given
number of documents. If say computer science appears p times in a given n dimension
document set, and say the terms computer engineering appear in the same set of
documents q times. There surely must be a relationship between these two sets of
documents. However, with the Vector Space Model as described by Salton (1989) [1],
there is no effort to take into account this phenomenal that could actually prove important
especially if from hind sight we assume that this in fact is a way of “syntaxing” a
document without actually having to get into the actual syntax as explicitly written down
within the language rules.
It can also be noted that correlations within documents, if mapped out with a suitable
concept or set of tools within the vector space model, can in turn represent semantics to a
small extent. Consider the terms pigs and rocket science, the chances of these two terms
appearing in the same document over a given dimension are limited. Hence the
relationship between term occurrences must have some significance in a given set of
documents. However it should be noted that with vectors, terms are simple coordinates
along the term axes. The essence of painting the picture is to locate within the space
where each document lies, and taking term frequency by itself cannot possibly redesign
the original configuration of the document.
To demonstrate how term occurrences are important to make note of, Wong, Ziarko,
Wong (1985) [4] examine a set of documents indexed by two terms using set notation.
[For the purposes of this report the author recommends revising some basic concepts of
algebra, vectors, set notation and matrix]
Consider a set of documents D such that D ={d1, d2, d3 ……dn} Let the document set be
indexed by two terms t1 and t2. Hence Dti can be normalised to contain only t1 and t2.
Using set notation to represent the set of documents, its possible to show the different
parts of the document collection. As seen from the diagram below it can be concluded
that there is a certain degree of correlation between terms occurring in the same
document.
Diagram 3.
Robert Ntege
The vector space Model
Page 13 of 22
As shown from the diagram above, the area a represents only the documents indexed by
term t1, and like wise area c represents only the documents represented by term t2. The
intersection of the sets is the area where both terms appear in the same set of documents.
As demonstrated above, there is a relationship between terms appearing together in the
same document. This can be further normalised to imply that the relationship between
terms is directly proportional to the number of documents in which the two terms appear
together.
From the above example, if we take the cardinality c(D) of any given set D. Given that
the set above has only two terms, the cardinality would be to c(Dt1t2). This function goes
some way to correlate the terms appearing in the same document. There fore we could
conclusively say that the subset
. In the vector model we can
correlate terms t1 and t2 by looking at the scalar product of the normalised vectors t1 and
t2. It should be noted that term correlations can well be an important part of the retrieval
system but correlations alone cannot possibly represent the entire document space. In the
following sections of the report several configurations and relationships between terms
within the document space are discussed and their implications examined
Possible conceptual solutions to improve the vector space and their implications:
It is generally agreed that the VSM can up to a certain point represent documents in
whole or at least a close percentage in terms of their indexed terms. However, it becomes
imperative to examine the many possible combinations of the space configuration that the
space model may actually take. Given the numerous functions that can be carried out on
vectors, some of which have been successful in other areas outside of Information
Retrieval. Like the engineering industry and other empirical based systems. It is essential
to investigate whether the concepts are actually misrepresented within the VSM. After all
it would appear fairly sensible that if one followed a map correctly one should be able to
get to their destination. Why then if we can exhaustively represent the information system
we desire to search, can’t we model it in such a way that the documents actually fit within
the space in such a distribution that pointing in a given direction should lead us to a
specific item.
From a user stand point there must be a set of documents within the space that exactly
match his/her query since a query is a vector within the defined space. So why doesn’t
this seem to be the case in information retrieval systems.
Salton, Wong and Yang (1975) [3] note that there are three possible scenarios that could
actually happen or be happening to the configured space. Since we don’t explicitly know
Robert Ntege
The vector space Model
Page 14 of 22
how the vector space model actually defines itself we can make a few assumptions based
on the vector model functions available.
From a user perspective there should be a perfect representation of documents that should
at least match the query. This could be in form of a space that is spread in such a way that
a certain collection documents within a certain area of the space actually matches the
query. It’s obvious to see that, since in information retrieval systems, the users are from
all walks of life and the information stored is of a vast wide topical nature. And since in
information retrieval systems, there is little or no knowledge of the possible user queries
it becomes almost impossible to represent this kind of space. As seen from diagram 1
below, the main question that then arises is that of term for term correlations.
Ideal vector space
In an ideal world figure two would be the perfect answer to indexing however it should
be noted that in information retrieval the user assessment for relevance in regards to the
query is not known prior to constructing the space. This model would suit an engineering
project where all the terms are fully normalised. But for the purposes of information
retrieval where terms are ambiguous and some times concepts mixed up, this is not the
sufficient way to try constructing the space. It is also noted that if a case exists for which
all terms can be implicitly expressed and normalised fully then the above space
configuration could well represent the perfect space. If a set of documents contained a
very specific area say medical records that were fixed, then this configuration could be
used. Therefore it should be noted that the above configuration could be part of an area
within the vector space. The author suggests that in cases where within the space, if terms
ti can be fully represented to exhibit characteristics of a specific type: It should follow
that there is a global state of stable inter relationships with a possible maximum set of .
Robert Ntege
The vector space Model
Page 15 of 22
Lets consider a space characterised by documents that have a maximum separation
between them. That is to say that each document is unique from the next and that the
relative distance between them characterises their similarity. Consider diagram two
below where x represents each document.
Diagram showing maximum separation
Is it the case that the set of terms chosen should be such that they are orthogonal to each
other? Where by a term representing one document is at right angles to the next hence
making the space shape itself in such a way that the furthest away each term is from the
next the more likely it will be isolated in an island of its own. Both conceptually and
computationally this scenario seems to be impractical. The function that would compute
the similarity between two documents would have to match up every single document
against each one of them in the collection.
Consider the function below as suggested by Salton, Wong and Yang (1975) [3] when a
collection of n documents is examined.
Where
is the similarity between documents i and j. If the above equation is
reduced, the average similarity between documents is smallest. The significance of this
spacing is the high precision it exhibits. Since a user query will only align itself nearest to
a specific set of documents and further away from the non-relevant ones. Recall is also
high since documents lying within the same area will also be retrieved while ignoring the
non-relevant ones. It is clearly not feasible to have a configuration of this nature. It is not
clear how optimum separation between documents to achieve optimum retrieval can be
Robert Ntege
The vector space Model
Page 16 of 22
represented since the terms we consider are orthogonal. And secondly, it is noted that the
number of comparisons that need be done for n documents is n squared.
Classification with centroids being the super class
The next space configuration looks at organising documents in a hierarchical manner
where by there is a certain point in the clustered point that is fully representative of the
area in question. This leads to the formation of classes, where documents are grouped
together and represented by a centroid as shown in the diagram below
The circles represent the different clusters of documents and the big black Xs represent
the centroids which lie almost relatively in the middle of each cluster. From the above
configuration it is also possible to have a main centroid for the entire collection of say n
documents. If we take a class or cluster of documents P consisting of n documents. It is
possible to compute each item of the centroid C as the average weight of the same
elements in the corresponding document vectors. It would therefore follow that we can
have a main centroid for a given collection. This would be computed from the average
weight of each the various centroids within the vector space.
From this configuration, we may then consider the sum of similarity coefficients with the
main centroid as the perfect measure for similarity betw This is computationally feasible,
as we only have to match each document once against the corresponding Centroid. This
Robert Ntege
The vector space Model
Page 17 of 22
in turn describes a centroid as a perfect return value for a query if a user were to
‘mistakenly’ or luckily matches the centroid’s coordinates.
Where C* is the main centroid
Since a centroid is collection of similar items within a given document set. However
most importantly to note is that we could introduce a function to classify documents in
such a way that there is a centroid in relation to every given pair of term occurrences
within a document set. This function would serve as a similarity coefficient between
terms as well as documents. This is probably the closest it gets as far as query matching
in the VSM is concerned. The main assumption here is that clusters will generally hold
documents with similar characteristics hence precision to a users query can be achieved.
The average similarity between different centroids is minimal hence avoiding retrieving
non-relevant documents. This configuration would generate tightly coupled documents
within clusters while loosely defining the different centroids.
Robert Ntege
The vector space Model
Page 18 of 22
Terms
Using the vector space model to represent information is already a questionable strategy
since the actual semantics and different term correlations that exist within documents
cannot be fully documented using vectors alone. Or at least there has not been a sufficient
system that can fully use the vector space model to fully model our information retrieval
systems. This problem is compounded even further still by the VSM relying fully on the
indexed terms used as the basis and generating set of all vectors within the entire model.
The smallest unit or sphere picture in the space model is a term by itself. Compounded in
the shape that accurately represented terms is already a significant problem to match
them out in the space, however it does not help at all if the terms chosen for indexing are
not accurately representing the documents within which they are contained.
It therefore follows that the following problems within the existing set-ups of various
information retrieval systems have even a more fundamental implication in the Vector
Space model. The terms used determine the vector space.
Term mismatch
Term mismatch at its worst is where the term used is representing a totally different
concept. It is often the case that authors use terms to represent different concepts in
documents than what users may first perceive. This problem is both language driven and
context dependent. The term mismatch problem is more severe when a user only uses a
few terms to represent a query. One approach to resolving this problem is by using longer
elaborate queries. Query expansion is seen as a way of providing the system with enough
information to try and match up terms within documents. The longer the query the greater
the chances of terms co- occurring within documents hence a usable relationship that may
help with precision and recall to the search. It should be said that term mismatch is
closely related to ambiguities. The other alternative to solve ambiguities is to reduce the
terms to mean very specific domain. There is quite a few approaches to dimensionality
reduction among which include manual thesauri, stemming, clustering and Latent
semantic indexing. None of these fully resolve the problems of term mismatching but
they go some way in improving the overall efficiency of the system.
Pearl programmers used a stemmer while implementing their vector space model search
engine and noted that it would be best to apply the stemmer after the stop list has been
enforced. For example if we took the words belong and belongings, after the stemmer has
been applied we can get the root belong with two other possible endings, belonging and
belongings. This allows flexibility of the use of the term and its referencing is improved
as the word could be used differently on different occasions. Its worth noting that without
a stemmer plurals could be missed when a search is submitted using singular terms.
Clustering uses a different approach but the principle is the same in the essence that
similar terms are grouped together in search a way that once one of them is mentioned
then the others are automatically referenced. If a user searches for say Personal
Computers (PCs), in a clustered environment one would expect the term PC to be
grouped with all items related to computing, say printers, hard drives, floppy discs and so
on…
Robert Ntege
The vector space Model
Page 19 of 22
Spelling mistakes
The diagram below is a representation of index alongside with its domain of documents it
represents. The domain is arranged as an inverted file system. If the term a12 were a
spelling mistake within the directory then document d3 would seize to exist. This is so
the case when a Boolean search is being carried out. In the vector space model, this term
would simply be non-existing as the possible terms are listed first. Spelling mistakes can
affect both the index and the domain of document representatives. It is therefore
recommended to use a thesauri or dictionary to proof read the query before its submitted.
A dictionary should be used to check the user’s query and then the query resubmitted. At
this point a dictionary could also offer the other possible synonyms and possibly expand
the query to include these terms when doing the search. It would be ideal if the indexed
terms were spell checked too.
All the above-mentioned term related problems are closely related to the ambiguity
within natural language itself. It is therefore fundamental to use approaches that try and
maintain the correlation natural language uses to allow free communication.
Robert Ntege
The vector space Model
Page 20 of 22
Conclusion
One question that arises after analysing how the VSM operates is its competence within
the information retrieval environment. Is it worth using the vector space model at all, can
the vector space model capture the whole picture or is it a question of mapping out
coordinates that are not fully representative of the actual document. The questions may
seem ambiguous but that is exactly the point. Does the vector space model address term
ambiguity in its entity or does it attempt to resolve it by ad hoc means when using
stemmers, clusters and the other methods discussed before
The relationships that naturally exist within documents over a spanned space are what
give the VSM its strength. Modelling information in the VSM therefore provides a
background from which concepts can be developed and extended to semantically define
the documents in question.
However on the other end of the coin is an uncomfortable idea that the terms we use to
represent the documents may be insufficiently expressed. It is therefore thought wise to
model the VSM in such a way that provides detail to the terms in questions.
It is the author’s opinion that there is much work to be done to improve the basic terms to
paint a richer picture within the space. However it should be noted that since the
semantics of terms are not a priority, the VSM is a Boolean matching mechanism from
the standpoint that, actual terms are being matched as opposed to concepts being
compared.
The Vector Space model has a future in information retrieval if the terms used to index
the documents can be mapped out slightly differently to incorporate relationships (within
a document) before looking at relationships between the various documents (over the
whole spanned domain/space). Testing within a very closed environment where all the
possible relationships within terms can be exhaustively measured from a user’s point of
view is one approach to consider. Until search experiments can provide consistent values
with concepts being modelled, the Vector Space model will only produce what goes in as
inputs and produce limited outputs. The future is positive, as we have conclusively
analysed that the VSM provides a framework in which an integrated document can be
represented is.
Current research:
The way forward is looking at detailed term correlations and its implications. Is it the
case that terms distribute themselves in a manner that terms are dependent on each other?
Can it be conceived that there is a certain level association that can be modelled to
maximise retrieval within information systems
Current research in neural networks and natural language processing will perhaps open
new doors to understanding the relationships between terms. Since the VSM has good
measures to represent these relationships, it is feasible to see an improved information
retrieval system with improved document representation.
Robert Ntege
The vector space Model
Page 21 of 22
References
[1] Salton, G. (1989) Automatic Text Processing: the transformation, analysis, and
retrieval of information by computer. London Addison-Wesley Publishing Company.
[2] Wong, S.K.M, Raghavan, V.V. (1984) Vector Space Model of Information Retrieval
– A Reevaluation. [Online] Available at:
http://portal.acm.org/citation.cfm?id=636816 Accessed 24th October 2004
[3] Salton, G., Wong, A., Yang, C.S. (1975) A Vector Space Model for Automatic
Indexing. [Online] Available at:
http://portal.acm.org/citation.cfm?id=361220
[4] Wong, S.K.M, Ziarko, W., Wong, P.C.N (1985) Generalized Vector Space Model in
Information Retrieval [Online] Available at:
http://portal.acm.org/citation.cfm?id=253506 Accessed 19th October 2004
[5] Feldman, S (1999) Natural Language Processing in Information Retrieval
[Online] Available at
http://www.onlineinc.com/onlinemag/OL1999/feldman5.html
Accessed 1st November 2004
[6] Ceglowisk, M. (2003) Building a Vector Space Search Engine in Perl
[Online] Available at:
http://www.perl.com/pub/a/2003/02/19/engine.html Accessed 21st October 2004
[7] van Rijsbergen, C. J. (1975) Information Retrieval [Online] Available at:
http://www.dcs.gla.ac.uk/Keith/Preface.html Accessed 12th October 2004
Robert Ntege
The vector space Model
Page 22 of 22
Download