IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653) World Wide Web The Web is becoming a universal repository of human knowledge and culture. It allows un-precedent sharing of ideas and information in a scale never seen before. Any user can create his own Web documents and make them point to any other Web documents without restrictions. Home shopping and home banking are becoming very popular and have generated several hundred million dollars in revenues. Web has Problems! Despite so much success, the Web has introduced new problems of its own. Finding useful information on the Web is frequently a tedious and difficult task. There is no fundamental data model for web which implies that information definition and structure is of low quality. These difficulties have attracted renewed interests in IR and its technology as promising solution. Information Retrieval at center of the stage What is an Information Retrieval System? An Information Retrieval System is capable of storage, retrieval and maintenance of information. Information can be text, images, audio, video and other multi-media objects. An information retrieval process begins when a user enters a query into the system. Several objects may match the query, perhaps with different degrees of relevancy. Indexing documents Indexes are one of the most important components of IR systems and web search. Indexing a document gives us the ability to fetch the required documents fast. It improves the efficiency of query by many folds but at the same time it takes a lot of time and memory to create indexes on a collection. When we are creating indexes for search engines we have to crawl all the documents on World Wide Web. Creating indexes (Basic idea) 1. Collect all the documents to be indexed. 2. Tokenize the text in each of the documents. (Tokens can be considered as words in the document) 3. Do some processing on the list of words to make them as indexes. 4. Index the documents that each term occurs in by creating an inverted index. Processing of words… Tokenization This is a process in which we break down the contents of documents into words, phrases, symbols, or other meaningful elements called as tokens. Stop words There are a set of words in English language which are considered to be very common. Examples of stop words are a, an, the, which, that, is, at etc. Processing of words… Stop words contd.. These words have no specific meaning and do not convey anything important or unique about a sentence or document. Sounds good? Hold on….! But can we really eliminate stop words?? In some cases we cannot eliminate stop words. Ex: for Queries like To be or not to be The who The rock 1. 2. 3. So stop words elimination should be done very carefully considering all these cases in mind. Processing of words… Stemming Stemming is the process for reducing words to their stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem. Words like Cat, Cats, Catlike, Cat’s, Cats’ will stem to word Cat Words like Institute, Institutionalization, Institutional will stem to word Processing of words… Stemming contd.. By stemming it becomes easier to group words or tokens which are derived from the same word. Many search engines treat words with same stem as synonyms. Problems with Stemming: Can you stem query terms like [Larry Page], [Steve Jobs], [Bill Gates] ? Processing of words… Thesaurus Expansions Pick words from the query and then expand it using a good thesaurus. Examples: A person looking for query [Cars] would also like to look at results of [Automobiles] Problems! If the query is short and not descriptive then the results drift in an undesirable direction. Ex: [Mining] would be expanded to [Data Mining] & [Gold Mining] Inverted index Documents are broken down into words and they are saved along with document IDs. Inverted index Postings list. Words and the document id’s http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html So if we want to fetch results for query [Brutus Calpurnia] we would take intersection of postings list of words [Brutus] & [ Calpurnia ]. Different Models of IR systems Boolean Model Vector Space Model (Ranked retrieval model) Probabilistic Model ( Ranked retrieval model ) Boolean Model Boolean model is one of the earliest Information Retrieval models which are very simple to understand and use. The user specifies queries with key words using conjunctions like AND, OR, NOT etc. Boolean model is based on set theory. AND (^): Intersection of two sets OR (V) : Union of two sets NOT (~) : Negation of a set. That is all the documents which does not contain a particular word Boolean Model Example 1. Want to get results which have terms both terms “Data” & “Mining”. Query = [data AND mining] 2. Want to get results which have either of the terms “Data” and “Mining” Query = [data OR mining] 3. Similarly if a user wants to find all the documents which have the terms X, Y but not Z, then he would form the query = [X AND Y NOT Z] Boolean Model Drawbacks It only returns the documents based on the matching. That is, it only checks whether a document has terms in the query or not. It cannot rank the documents based on the relevance. When the query string has too many conditions it becomes very difficult to understand as the words are cluttered. However, it is still used by many power users like librarians and researches as they feel it as they are in control of retrieval process. Vector Space Model (Ranked retrieval) In the vector space model documents/ webpages are represented by a vector of terms. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space. Even the query is considered as a vector. Each of the dimensions have the weight associated with them. The weight of a component in document vector is zero if the corresponding word is not present in a document. Vector Space Model (Ranked retrieval) The weights of the terms are not inherent in the model but we can use some weighing scheme like tf-idf. To assign a numeric score to a document for a query, the model measures the similarity between the query vector and the document vector. Typically, the angle between two vectors is used as a measure of divergence between the vectors, and cosine of the angle is used as the numeric similarity. Cosine is 1 for identical vectors and 0 for orthogonal vectors. Vector Space Model (Ranked retrieval) Scoring of document is computed as follows: Score(q ,d ) = 𝑉 𝑞 ∗ 𝑉(𝑑) 𝑉(𝑞) ∗ 𝑉(𝑑) Demerits It considers documents as a list of words and does not consider the ordering. Ex: New York is split into two words New and York and then processed separately. The vectors are usually very huge and most of the components have value as zero. The query vector has to be then multiplied with all the other documents for relevant results. Conclusion The field of information retrieval has come a long way in the last forty years, and has enabled easier and faster information discovery. Techniques developed in the field have been used in many other areas and have yielded many new technologies which are used by people on an everyday basis, e.g., web search engines, junk-email filters, news clipping services. Going forward, the field is attacking many critical problems that users face in todays information-ridden world. With exponential growth in the amount of information available, information retrieval will play an increasingly important role in future. References [1] http://people.ischool.berkeley.edu/~hearst/irbook/1/node1.html [2] http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-buildingan-inverted-index-1.html [3] http://en.wikipedia.org/wiki/Stemming [4] http://www.cs.columbia.edu/~gravano/cs6111/Lectures/lect03-2.pdf [5] http://nlp.stanford.edu/IR-book/html/htmledition/normalizationequivalence-classing-ofterms-1.html [6] http://comminfo.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html#2.3.1 [7] Modern Information Retrieval by Amit Singhal.