Terms of Reference Terms of Reference ............................................................................................................. 1 Abstract …………………………………………………………………………………...2 THE VECTOR SPACE MODEL- AN INFORMATION RETRIEVAL METHOD ......... 2 How the search engines work ............................................................................................. 5 Search Technologies ........................................................................................................... 5 Interactive Search/Feedback ............................................................................................... 6 Expressing documents in terms of index terms .................................................................. 7 Measuring query correlations ............................................................................................. 9 Ranking results.................................................................................................................. 10 Ideal vector space .............................................................................................................. 15 Classification with centroids being the super class........................................................... 17 Term mismatch ................................................................................................................. 19 Spelling mistakes .............................................................................................................. 20 References……………………………………………………………………………..…22 Robert Ntege The vector space Model Page 1 of 22 THE VECTOR SPACE MODEL- AN INFORMATION RETRIEVAL METHOD Abstract: The nature of this report is to exhaustively describe the vector space model method as used in information retrieval systems. The author attempts to critically analyse the strengths of the vector space model while suggesting ways in which to improve its shortcomings. The computational vector space is at best misused or to put it in another way underused. The vector space model is fundamentally prone to under achieve in the information retrieval domain where ambiguities within the indexed documents exist. It should be noted that ambiguity is a natural phenomenal within natural language. The strengths and weakness lies in the fact that we strip the entire document collection into suitable indexed terms and it is this point where the problem takes shape. We assume that the terms are spread in the vector space in such a way that they are evenly distributed to maintain the overall picture of the collected terms without implicitly saying what the terms really are. The terms used to describe a vector space must be carefully examined such that their dependency within the document can still be clearly documented. The author attempts to investigate the different algorithms used to study the vector space. Therefore it will be observed that there is a little more emphasis on the terms used to form the basic vectors as opposed to the actual mechanics of the vectors themselves. Perhaps the most important aspect of this report is that the nature and performance of the vector space model is still modelled by the user and the shape of the vector space takes shape in terms of the different information stored in the collection. Therefore studying the vector space and the shape it might take is the key point to improving its performance in response to a user’s query. Robert Ntege The vector space Model Page 2 of 22 Discussion. In information retrieval systems where millions of documents representing information regarding almost every aspect of human kind, a technique for retrieving individual elements within a collection is no easy feat. It is therefore thought based to model the information represented as small units that can be place in a sphere in such a way that there are relative position is known or can be computed given the different relationships within this sphere. The VSM attempts to draw a map of documents using their “pictorial content” when it attempts to use terms to represent documents. In fact this seems like a plausible way of modelling information. After all we take pictures of areas and map them out in a very specific manner that allows to get from place a to place b. Or consider a patient with breath related problems. If there are special hospitals for each condition, it is then thought that the user needs a map of some kind to locate amongst these many hospitals, a specific one for his/her problem. Unfortunately as mentioned before the information pages are vast and unstructured. The only commonality within documents that can get us started is the words or terms within documents. The possible pictures we can paint involve stripping documents to their atomic content, that is special terms. And it therefore follows that if we know the makeup of documents can we fully model documents in such a way that is fully representative of these documents over a given space. Other questions arise: Questions: 1. Would it be best to construct a vector space in such a way that each term is as far away form the next as possible? 2. Is there an actual mechanism for modelling documents to achieve maximum retrieval values for a given user query. That is to say how can we locate each item within a collection of elements all with the same makeup, without actually knowing or understanding what each element stands for? 3. Is it the case that Natural Language Processing involves relationships that we probably don’t necessarily concisely use? There is no explicit representation of the vectors hence the semantics are completely thrown out of the window hence left grasping at how important term co-occurrence can represent semantics. Robert Ntege The vector space Model Page 3 of 22 Introduction Information storage systems are constructed in such a way that the information stored should be easily accessible. However its easier said than done as it is generally accepted that once the documents are all mixed up their retrieval can sometimes become as random as determining the brightest star. In the Internet environment where there is millions of different information documents stored the task is even greater. A good structure or representation of the information is one piece of the puzzle. To a certain extent, constructing logical structures may be relatively achieved while on the other hand matching a users information request to a given document within the perfect index is another problem altogether. It is therefore thought best to construct an index of terms while keeping the users in perspective. This is even more valuable as the vector space model does not explicitly represent the terms. The vector space model essentially attempts to represent documents and terms as vectors. This is where we lose some degree of certainty as where a document may lie in relation to the next in the VSM. Its not far fetched to assume that each index term is a mapping within a document on the space, this statement is an attempt to put the vector space within in a plane. It is therefore conceivable that we shall have a clustered space where documents that represent the same terms or concepts lie within close proximity. This idea naturally leads us to conceive of document centroids where the density of documents within a point represents similar items. A new measure would be to distribute the space in such a way that terms are as far apart as possible. This assumption lends itself to new contradictions. For instance imagine an information storage retrieval system of a bank. Where for the purposes of this example it is strictly money transactions being stored. If a user constructed a vector into the space, say asking about pork bellies. We can clearly see the system being 100 percent precise. As this vector will lie well far away from other pairings in the vector space. It’s on this assumption that we proceed to exhaustively examine the vector space model as first proposed by Salton (1989) [1] It’s worth mentioning that the vector space model severely suffers from the “term mismatch” problem. As we shall later discuss, the VSM takes on terms that are a finite representation of all the possible basic terms within the space. However as is always the case one term can be directly interchangeable with another and this is where the VSM lags a little and needs some mechanism to accommodate synonyms. It would be ideal to break down a query into simple parts and resolve the concept differences that may be present. Since term frequency is one of the basses for constructing a vector model, it is essential that terms are well represented. The vector space model has a few fundamental assumptions it makes before attempting to map out the domain. Firstly there is the assumption that some words are more important than others. Word frequency is thought to be an important aspect as it is thought that the more a term appears in a document the more significance it represents. Secondly Luhn’s analysis of content and functional words is put into practice by applying a stop list to remove all common words. Essentially the stop list includes pronouns, articles, and connectors etc that are to be stripped from the document. The remaining terms in the document form the basis for the desired vector space to be modeled. Robert Ntege The vector space Model Page 4 of 22 Brief Introduction to Information retrieval and its problems. In the past few years the Internet has grown to an unprecedented scale and with it the information stored has increased to a point that it’s archival and retrieval are now a major issue of concern. Perhaps by far, the major point of concern is the structure of the internet which leaves us with only one possible way of collecting information, that is, manual information gathering is not an alternative, but automated indexing of information is sought as a default. Automated indexing in the present IR systems has a few problems that present themselves in terms of strategic positioning, nature of information stored and finally the kind of users expected to use the IR systems. A good information retrieval system should be able to take a user query and return as many relevant documents as possible while rejecting the non-relevant items. This leads to the two common terms used in IR as a measure of how good the system is, that is the system should have a good precision and high recall. How the search engines work Putting private information retrieval systems aside, and considering the Internet. There is at least three major parts to a search engine. First there is the spider also known as the crawler that searches the different servers for new or updated web pages. The information gathered is stored in an index. How the index is formulated varies from company to company and the methods for retrieval. Most search engines have an automated index; the crawler updates the index with the information it gathers. This is usually in form of key words, which are related to specific documents. The second part of the search engine is the index. As mentioned earlier this contains keywords that are related to specific documents on the web. Usually a users query is directed towards the index where the comparison takes place. Typically the index is populated in a hierarchical fashion distinguishing between categories and sub-categories. With automated search engines the categories can be either specified in the Meta tags or picked out of the title pages. However, where a page is manually submitted like with Yahoo, the search engine editor determines the category. The third part of the search engine is probably the most important aspect of it all. It is the search software that does the actual sifting of pages to determine which pages are relevant to the user query. The search software pre determines how the index will look like and how it will be populated. The software also does the ranking of results, which are displayed on a users screen in a format that is suitable. Another area comes into play with visualisation techniques used to format and present information in a manner that is appropriate and “attractive”. Search Technologies The basic fundamental of having an IR system is to provide a user with an interface, which they can interact with to retrieve the information stored. It is therefore the basis of this report to look at the various ways of retrieving information from the Information Storage Systems. In this report, the Vector Space model is extensively studied, however, Robert Ntege The vector space Model Page 5 of 22 it’s worth noting that there is a few other search technologies available among which include the following. Boolean Search: A Boolean search is based on the use of keywords. Basically a query is used as the equivalent of 1. Therefore the index term is either equal to one or zero. If the index term is true to the query then its returned as a search result. Upon retrieval other criteria is used to rank the results according to how relevant they are to the query. The usual logical connectors like AND, OR, NOT are used to make a combination of query terms. Since query terms are considered letter for letter, its noted here that a spelling mistake would have to be rectified by the search engine interface otherwise there would be no results to return unless an index term with a spelling term exists in the index. Its for this reason that a thesauri or dictionary can be used to either expand or check the query for errors or other possible meanings. This added feature adds some efficiency to the Boolean Search technique but generally this is a very limited way of searching for documents. As we are well aware of concepts being expressed using many terms. Feldman (1999) [5] gives the classic example of the term football or soccer as used by American dictionaries. The two words are interchangeable, however if a query were passed to a Boolean Search engine asking for football articles, it would unfortunately miss those articles that used the word soccer. Rijsbergen [7] rightly calls this method simple matching, as it is clear that the function is only looking for commonality between document and query. “In fact, simple matching may be viewed as using a primitive matching function. For each document D we calculate |D [[intersection]] Q|, that is the size of the overlap between D and Q, each represented as a set of keywords.” Where Q is the set of keywords in the query Q. Interactive Search/Feedback It is typical of users not to be sure or exact of what they are searching for. However this can be relatively eased if there is interaction with the search engine. Basically the user types an initial query from which search engine fast assess the users information needs. If for example the user realises that the search term he/she is using is a frequent one, then it might be wiser to narrow down the search or be more specific. If from the initial results the user gets too many irrelevant hits but with the term used in them, then it would be prudent to consult a dictionary for a term that may represent the information needs in a more specific way. Basically the user has the ability to change the search criteria by trying and modifying until the query is refined to a point where precision and recall are maximised. The one drawback with this approach is that rarely do users want this kind of system where the burden lies with them not the search engine. The try and error method would not be very successful in non-academic environments where sometimes people are not even so sure of what they are really looking for. However it should be said that if an Interactive system was made in search a way it is user friendly with nice graphical presentations then the user may find it less tasking. There is a slight waste of time but this is at the expense of “precision” if finally the user gets exactly what he/she wants out of the system. Other technologies exist, cluster based retrieval, cluster representatives, serial search, matching functions and the VSM that we shall shortly discuss. Robert Ntege The vector space Model Page 6 of 22 Basic concepts of the Vector Space Model The first step in building a vector space is determining which terms within the documents will best represent the content of each document in turn. When a set that can exhaustively represent each and every document within a collection has been achieved, matching each document to its corresponding set of index terms can then do the indexing. In the VSM, both documents and index terms are represented as basic vectors in a linear space. Therefore the space is determined by both the terms used to index and the documents represented. The occurrence of a term in a document represents part of the document along that term vector. Therefore the total elements concerning the documents to be retrieved are modelled as a vector space. The queries are also represented as basic vectors, which are in turn relatively measured in the vector space. At this point it would be essential to note a few properties within a vector space. The properties lend themselves to accommodate new concepts that we perceive of the vector space, since these concepts can be represented as simple vectors within the vector space. It can be argued that it’s this area of the vector space that probably needs closer investigation to determine the efficiency of using VSM. Wong and Raghaven (1984) [2] propose some important properties of the vector space that in turn help expand the basic vector to a new one. Properties: a) The ability to add any two elements of the system to obtain a new element of the system b) The ability to multiply any element by a real number c) The vectors obey several basic algebraic rules Expressing documents in terms of index terms The basic assumption is that all vectors in the space are a unit length and that there is no explicit representation of terms. It ensues that vectors do not have to start at zero within the coordinate system of the space. The relative distance between vectors is measured and preserved. The second implication is that projections of each vector pair are considered against the space. Its then conceivable that each document is a point in the space represented by the area where the document vector touches the space. Similar documents will hence lie close together within the space. As we shall later on see the index terms are the generating set to the space we desire to model. Hence the space is a finite space with n vectors mapped out distributively. As an example lets take a collection of three documents represented by three index terms. Consider the following diagram representing its vector space. Robert Ntege The vector space Model Page 7 of 22 Diagram(1) let the documents be represented by a vector let the query be represented by a vector let the index terms T1, T2, T3,…Tn be the terms represented by term vectors . If “a boy and a dog” is a document A, and “a boy” is document B. The vector space generated by the documents A and B are the terms (a boy and stray dog) can be represented as below A 2 Boy 1 And 1 stray 1 Document A is as follows A Boy And Stray 2 1 1 1 dog 1 Dog 1 The vector of A is as follows 2,1,1,1,1 The vector of B is as follows:1,1,0,0,0. The plotting of vectors in a given space is confined to the maximum number of terms the space has. It’s this boundary that creates Robert Ntege The vector space Model Page 8 of 22 the measure of how the terms spread across the documents. The relationship of document A with B as an image picture is the simple fact that document B is two fifth of document A. Term weighting is useful from the perspective that, the more a term appears in a certain document the more stronger the association of the term and document. The inverse is also true, the infrequence of term in a other documents emphasises its importance in a document where it appears. When the terms are normalised documents can be given scope search that bigger documents don’t score higher than necessary. For weighting purposes the terms could be assigned a 0 or1 to the power 1 and so on. This aspect is generally good for establishing importance (frequency measure) of terms within a document. In our example above the basic vector can be realised as follows. Since we have five terms the vector would be denoted by five 0 spaces. Whenever a term appears, it is represented by 1 and the corresponding power relating to its term occurrence is also assigned. 00000 10010 Let each term be represented by vectors From the model represented in diagram 1 the basis of the space is term vectors hence we can conclusively say that the entire space is just a combination of . The documents are represented as a linear combination of several term vectors. We can here note that the vector space is a finite collection of document d is explicitly expressed by a t dimensional vector. Therefore each Measuring query correlations We can either use the inner product of the vectors or the inverse function of the angle between corresponding vector pairs. The user querry will essentially take the shape The scalar product is used to measure the correlation between d^ and q^ The relationships between the generated coefficients can then be used to rank the retrieved information documents. In essence we are taking the relationship between the query terms against the document terms as measured against each basic vector term. The Robert Ntege The vector space Model Page 9 of 22 scalar product is a relationship born out of the projection of vectors onto each other. This is an important aspect in such a way that it allows n-vector dimensions to take shape. Since each vector can be expressed implicitly as a function of the vector adjacent to it. Diagram 2 shows the projection of vector query q against document d. From the above notation it is conclusive to assume that each document can be expressed as a small area within the document space represented or expressed in terms of the terms it contains. The relationships between the different terms will therefore determine how the users query can align itself in the space. The projection of one term against another measures direction in that specified term’s projection. The magnitude is represented as the distance of separation between the two terms over a specified space. In the case of the VSM for information retrieval this works fine as long as we assume that the terms are totally independent of each other, but exhibit some relationships when considered as a set. This is what makes the vector space model expandable and it is on this basis that we create a type terms that is sufficient to represent in vector terms. Diagram 2 The vector space has a dimension of two terms. The important thing to note in the above diagram is that both terms and documents are represented as simple vectors that have an associative existence in the space. The projection of term 1 on term 2 inherently compares the two documents represented in the space above. This is more the fact since we express documents in terms of simple term vectors. The basic components of the documents being terms provide a plausible way of imaging one document onto the other. Ranking results Let us represent the vector, Robert Ntege The vector space Model Page 10 of 22 In matrix notation: Given that: It is important to note that the above matrix implicitly represents the term occurrence frequency of the terms within a document without taking note of their relationships. It is a simple map of where the terms appear in the document and the relationship between them is not examined at all. But at the same time it is important to note that just like any other map, it points us exactly to a point of reference within the space. Hence it is conceivable here to assume that the picture painted has to be viewed from different angles to suit different needs. Representing the scalar product of the vector in matrix terms as below, It follows that G the measuring matrix would be a perfect 1 as seen from the equation above In this representation it is seen that the area within the vector space that the query q matches against is an identity matrix G. This assumption is drawn from the representation of A. Since the comparison has to take place along all the basic term vectors. Hence G can be assumed to be an identity matrix with only one possible value 1. Therefore to get a list of relevant documents the following ranking equation can be used. Robert Ntege The vector space Model Page 11 of 22 Which is a simple measure of vector q in the direction and magnitude of the space plotted by vector d. Where S is the similarity function in terms of the scalar product of d and q, that is when S is normalised , The results can then be displayed as the decreasing order of similarity to the matrix painted above. Robert Ntege The vector space Model Page 12 of 22 Term correlations As mentioned above, the terms within documents have a certain correlation over a given number of documents. If say computer science appears p times in a given n dimension document set, and say the terms computer engineering appear in the same set of documents q times. There surely must be a relationship between these two sets of documents. However, with the Vector Space Model as described by Salton (1989) [1], there is no effort to take into account this phenomenal that could actually prove important especially if from hind sight we assume that this in fact is a way of “syntaxing” a document without actually having to get into the actual syntax as explicitly written down within the language rules. It can also be noted that correlations within documents, if mapped out with a suitable concept or set of tools within the vector space model, can in turn represent semantics to a small extent. Consider the terms pigs and rocket science, the chances of these two terms appearing in the same document over a given dimension are limited. Hence the relationship between term occurrences must have some significance in a given set of documents. However it should be noted that with vectors, terms are simple coordinates along the term axes. The essence of painting the picture is to locate within the space where each document lies, and taking term frequency by itself cannot possibly redesign the original configuration of the document. To demonstrate how term occurrences are important to make note of, Wong, Ziarko, Wong (1985) [4] examine a set of documents indexed by two terms using set notation. [For the purposes of this report the author recommends revising some basic concepts of algebra, vectors, set notation and matrix] Consider a set of documents D such that D ={d1, d2, d3 ……dn} Let the document set be indexed by two terms t1 and t2. Hence Dti can be normalised to contain only t1 and t2. Using set notation to represent the set of documents, its possible to show the different parts of the document collection. As seen from the diagram below it can be concluded that there is a certain degree of correlation between terms occurring in the same document. Diagram 3. Robert Ntege The vector space Model Page 13 of 22 As shown from the diagram above, the area a represents only the documents indexed by term t1, and like wise area c represents only the documents represented by term t2. The intersection of the sets is the area where both terms appear in the same set of documents. As demonstrated above, there is a relationship between terms appearing together in the same document. This can be further normalised to imply that the relationship between terms is directly proportional to the number of documents in which the two terms appear together. From the above example, if we take the cardinality c(D) of any given set D. Given that the set above has only two terms, the cardinality would be to c(Dt1t2). This function goes some way to correlate the terms appearing in the same document. There fore we could conclusively say that the subset . In the vector model we can correlate terms t1 and t2 by looking at the scalar product of the normalised vectors t1 and t2. It should be noted that term correlations can well be an important part of the retrieval system but correlations alone cannot possibly represent the entire document space. In the following sections of the report several configurations and relationships between terms within the document space are discussed and their implications examined Possible conceptual solutions to improve the vector space and their implications: It is generally agreed that the VSM can up to a certain point represent documents in whole or at least a close percentage in terms of their indexed terms. However, it becomes imperative to examine the many possible combinations of the space configuration that the space model may actually take. Given the numerous functions that can be carried out on vectors, some of which have been successful in other areas outside of Information Retrieval. Like the engineering industry and other empirical based systems. It is essential to investigate whether the concepts are actually misrepresented within the VSM. After all it would appear fairly sensible that if one followed a map correctly one should be able to get to their destination. Why then if we can exhaustively represent the information system we desire to search, can’t we model it in such a way that the documents actually fit within the space in such a distribution that pointing in a given direction should lead us to a specific item. From a user stand point there must be a set of documents within the space that exactly match his/her query since a query is a vector within the defined space. So why doesn’t this seem to be the case in information retrieval systems. Salton, Wong and Yang (1975) [3] note that there are three possible scenarios that could actually happen or be happening to the configured space. Since we don’t explicitly know Robert Ntege The vector space Model Page 14 of 22 how the vector space model actually defines itself we can make a few assumptions based on the vector model functions available. From a user perspective there should be a perfect representation of documents that should at least match the query. This could be in form of a space that is spread in such a way that a certain collection documents within a certain area of the space actually matches the query. It’s obvious to see that, since in information retrieval systems, the users are from all walks of life and the information stored is of a vast wide topical nature. And since in information retrieval systems, there is little or no knowledge of the possible user queries it becomes almost impossible to represent this kind of space. As seen from diagram 1 below, the main question that then arises is that of term for term correlations. Ideal vector space In an ideal world figure two would be the perfect answer to indexing however it should be noted that in information retrieval the user assessment for relevance in regards to the query is not known prior to constructing the space. This model would suit an engineering project where all the terms are fully normalised. But for the purposes of information retrieval where terms are ambiguous and some times concepts mixed up, this is not the sufficient way to try constructing the space. It is also noted that if a case exists for which all terms can be implicitly expressed and normalised fully then the above space configuration could well represent the perfect space. If a set of documents contained a very specific area say medical records that were fixed, then this configuration could be used. Therefore it should be noted that the above configuration could be part of an area within the vector space. The author suggests that in cases where within the space, if terms ti can be fully represented to exhibit characteristics of a specific type: It should follow that there is a global state of stable inter relationships with a possible maximum set of . Robert Ntege The vector space Model Page 15 of 22 Lets consider a space characterised by documents that have a maximum separation between them. That is to say that each document is unique from the next and that the relative distance between them characterises their similarity. Consider diagram two below where x represents each document. Diagram showing maximum separation Is it the case that the set of terms chosen should be such that they are orthogonal to each other? Where by a term representing one document is at right angles to the next hence making the space shape itself in such a way that the furthest away each term is from the next the more likely it will be isolated in an island of its own. Both conceptually and computationally this scenario seems to be impractical. The function that would compute the similarity between two documents would have to match up every single document against each one of them in the collection. Consider the function below as suggested by Salton, Wong and Yang (1975) [3] when a collection of n documents is examined. Where is the similarity between documents i and j. If the above equation is reduced, the average similarity between documents is smallest. The significance of this spacing is the high precision it exhibits. Since a user query will only align itself nearest to a specific set of documents and further away from the non-relevant ones. Recall is also high since documents lying within the same area will also be retrieved while ignoring the non-relevant ones. It is clearly not feasible to have a configuration of this nature. It is not clear how optimum separation between documents to achieve optimum retrieval can be Robert Ntege The vector space Model Page 16 of 22 represented since the terms we consider are orthogonal. And secondly, it is noted that the number of comparisons that need be done for n documents is n squared. Classification with centroids being the super class The next space configuration looks at organising documents in a hierarchical manner where by there is a certain point in the clustered point that is fully representative of the area in question. This leads to the formation of classes, where documents are grouped together and represented by a centroid as shown in the diagram below The circles represent the different clusters of documents and the big black Xs represent the centroids which lie almost relatively in the middle of each cluster. From the above configuration it is also possible to have a main centroid for the entire collection of say n documents. If we take a class or cluster of documents P consisting of n documents. It is possible to compute each item of the centroid C as the average weight of the same elements in the corresponding document vectors. It would therefore follow that we can have a main centroid for a given collection. This would be computed from the average weight of each the various centroids within the vector space. From this configuration, we may then consider the sum of similarity coefficients with the main centroid as the perfect measure for similarity betw This is computationally feasible, as we only have to match each document once against the corresponding Centroid. This Robert Ntege The vector space Model Page 17 of 22 in turn describes a centroid as a perfect return value for a query if a user were to ‘mistakenly’ or luckily matches the centroid’s coordinates. Where C* is the main centroid Since a centroid is collection of similar items within a given document set. However most importantly to note is that we could introduce a function to classify documents in such a way that there is a centroid in relation to every given pair of term occurrences within a document set. This function would serve as a similarity coefficient between terms as well as documents. This is probably the closest it gets as far as query matching in the VSM is concerned. The main assumption here is that clusters will generally hold documents with similar characteristics hence precision to a users query can be achieved. The average similarity between different centroids is minimal hence avoiding retrieving non-relevant documents. This configuration would generate tightly coupled documents within clusters while loosely defining the different centroids. Robert Ntege The vector space Model Page 18 of 22 Terms Using the vector space model to represent information is already a questionable strategy since the actual semantics and different term correlations that exist within documents cannot be fully documented using vectors alone. Or at least there has not been a sufficient system that can fully use the vector space model to fully model our information retrieval systems. This problem is compounded even further still by the VSM relying fully on the indexed terms used as the basis and generating set of all vectors within the entire model. The smallest unit or sphere picture in the space model is a term by itself. Compounded in the shape that accurately represented terms is already a significant problem to match them out in the space, however it does not help at all if the terms chosen for indexing are not accurately representing the documents within which they are contained. It therefore follows that the following problems within the existing set-ups of various information retrieval systems have even a more fundamental implication in the Vector Space model. The terms used determine the vector space. Term mismatch Term mismatch at its worst is where the term used is representing a totally different concept. It is often the case that authors use terms to represent different concepts in documents than what users may first perceive. This problem is both language driven and context dependent. The term mismatch problem is more severe when a user only uses a few terms to represent a query. One approach to resolving this problem is by using longer elaborate queries. Query expansion is seen as a way of providing the system with enough information to try and match up terms within documents. The longer the query the greater the chances of terms co- occurring within documents hence a usable relationship that may help with precision and recall to the search. It should be said that term mismatch is closely related to ambiguities. The other alternative to solve ambiguities is to reduce the terms to mean very specific domain. There is quite a few approaches to dimensionality reduction among which include manual thesauri, stemming, clustering and Latent semantic indexing. None of these fully resolve the problems of term mismatching but they go some way in improving the overall efficiency of the system. Pearl programmers used a stemmer while implementing their vector space model search engine and noted that it would be best to apply the stemmer after the stop list has been enforced. For example if we took the words belong and belongings, after the stemmer has been applied we can get the root belong with two other possible endings, belonging and belongings. This allows flexibility of the use of the term and its referencing is improved as the word could be used differently on different occasions. Its worth noting that without a stemmer plurals could be missed when a search is submitted using singular terms. Clustering uses a different approach but the principle is the same in the essence that similar terms are grouped together in search a way that once one of them is mentioned then the others are automatically referenced. If a user searches for say Personal Computers (PCs), in a clustered environment one would expect the term PC to be grouped with all items related to computing, say printers, hard drives, floppy discs and so on… Robert Ntege The vector space Model Page 19 of 22 Spelling mistakes The diagram below is a representation of index alongside with its domain of documents it represents. The domain is arranged as an inverted file system. If the term a12 were a spelling mistake within the directory then document d3 would seize to exist. This is so the case when a Boolean search is being carried out. In the vector space model, this term would simply be non-existing as the possible terms are listed first. Spelling mistakes can affect both the index and the domain of document representatives. It is therefore recommended to use a thesauri or dictionary to proof read the query before its submitted. A dictionary should be used to check the user’s query and then the query resubmitted. At this point a dictionary could also offer the other possible synonyms and possibly expand the query to include these terms when doing the search. It would be ideal if the indexed terms were spell checked too. All the above-mentioned term related problems are closely related to the ambiguity within natural language itself. It is therefore fundamental to use approaches that try and maintain the correlation natural language uses to allow free communication. Robert Ntege The vector space Model Page 20 of 22 Conclusion One question that arises after analysing how the VSM operates is its competence within the information retrieval environment. Is it worth using the vector space model at all, can the vector space model capture the whole picture or is it a question of mapping out coordinates that are not fully representative of the actual document. The questions may seem ambiguous but that is exactly the point. Does the vector space model address term ambiguity in its entity or does it attempt to resolve it by ad hoc means when using stemmers, clusters and the other methods discussed before The relationships that naturally exist within documents over a spanned space are what give the VSM its strength. Modelling information in the VSM therefore provides a background from which concepts can be developed and extended to semantically define the documents in question. However on the other end of the coin is an uncomfortable idea that the terms we use to represent the documents may be insufficiently expressed. It is therefore thought wise to model the VSM in such a way that provides detail to the terms in questions. It is the author’s opinion that there is much work to be done to improve the basic terms to paint a richer picture within the space. However it should be noted that since the semantics of terms are not a priority, the VSM is a Boolean matching mechanism from the standpoint that, actual terms are being matched as opposed to concepts being compared. The Vector Space model has a future in information retrieval if the terms used to index the documents can be mapped out slightly differently to incorporate relationships (within a document) before looking at relationships between the various documents (over the whole spanned domain/space). Testing within a very closed environment where all the possible relationships within terms can be exhaustively measured from a user’s point of view is one approach to consider. Until search experiments can provide consistent values with concepts being modelled, the Vector Space model will only produce what goes in as inputs and produce limited outputs. The future is positive, as we have conclusively analysed that the VSM provides a framework in which an integrated document can be represented is. Current research: The way forward is looking at detailed term correlations and its implications. Is it the case that terms distribute themselves in a manner that terms are dependent on each other? Can it be conceived that there is a certain level association that can be modelled to maximise retrieval within information systems Current research in neural networks and natural language processing will perhaps open new doors to understanding the relationships between terms. Since the VSM has good measures to represent these relationships, it is feasible to see an improved information retrieval system with improved document representation. Robert Ntege The vector space Model Page 21 of 22 References [1] Salton, G. (1989) Automatic Text Processing: the transformation, analysis, and retrieval of information by computer. London Addison-Wesley Publishing Company. [2] Wong, S.K.M, Raghavan, V.V. (1984) Vector Space Model of Information Retrieval – A Reevaluation. [Online] Available at: http://portal.acm.org/citation.cfm?id=636816 Accessed 24th October 2004 [3] Salton, G., Wong, A., Yang, C.S. (1975) A Vector Space Model for Automatic Indexing. [Online] Available at: http://portal.acm.org/citation.cfm?id=361220 [4] Wong, S.K.M, Ziarko, W., Wong, P.C.N (1985) Generalized Vector Space Model in Information Retrieval [Online] Available at: http://portal.acm.org/citation.cfm?id=253506 Accessed 19th October 2004 [5] Feldman, S (1999) Natural Language Processing in Information Retrieval [Online] Available at http://www.onlineinc.com/onlinemag/OL1999/feldman5.html Accessed 1st November 2004 [6] Ceglowisk, M. (2003) Building a Vector Space Search Engine in Perl [Online] Available at: http://www.perl.com/pub/a/2003/02/19/engine.html Accessed 21st October 2004 [7] van Rijsbergen, C. J. (1975) Information Retrieval [Online] Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html Accessed 12th October 2004 Robert Ntege The vector space Model Page 22 of 22