Module 8 Introduction There is so much data on the Web that we cannot hope to find what we want without the help of a search engine, such as Google, Yahoo, or Bing. How do these engines provide us with the webpages that we want to see? The goals of this first module examining search engines are: to understand the structure of a Web index; to know what changes are applied to words on a webpage when they are converted into index terms; to learn how queries are converted into search terms; to appreciate the need for large computer clusters to serve Web searches in practice. More specifically, by the end of the module, you will be able to: describe the contents of postings lists; explain indexing concepts, including case folding, stop words, and stemming; explain the vocabulary problems arising from word segmentation, synonymy, polysemy, and variant spellings; describe the main components of a simple Web search engine and how a user’s search flows through the engine ending with the results page being returned to the user. The online materials are supplemented by the first part of Chapter 4 in Web Dragons. You are responsible for the material from both sources. © 2009, University of Waterloo 8.0 Searching and Indexing Today there are more than one trillion webpages[1], containing an incredible amount of information on every topic imaginable. Furthermore, the number of webpages is growing by billions of pages each day. Like the libraries of old, it is impossible to find what you are looking for by browsing around from page to page. Even if a computer could do the browsing for you and could load 1000 pages per second, it would take over 31 years to check every page one after the other. As you have experienced, however, Google and other search engines can return answers to your requests almost instantaneously. How do they do it? Before answering that question, let’s look a bit closer at the types of searches, or queries, that Google supports. (Similar facilities are provided by many other engines, but we’ll look at Google to make our discussion more concrete.) Practice Exercises 1) Open a browser window and enter Google’s web address. Notice the search box for entering your query. Notice also the two search buttons labeled Google Search and I’m Feeling Lucky. At the top of the window, notice links labeled Web, Images, Videos, Maps, News, and so on. 2) In the query box, enter the word Waterloo and then click Google Search. How long did it take to respond? How many webpages does Google report as having that word? Notice that Google returns references to several pages that it believes are most likely to match what you are searching for. For each one, it provides the page title, a snippet from the webpage showing your search word, the URI for the page, and links labeled cached and similar. Click on a webpage title (or URI) to see the full webpage found by Google. Does this page match what you had in mind when you first typed the search word? Use your browser’s back button to return to the results page and select another webpage title to examine another match. 3) Use the back button on your browser to go back to the Google search page and enter Waterloo in the search box again. Press I’m Feeling Lucky. Compare the page that Google returns to the list of pages you saw previously. 4) Again, use the back button on your browser to go back to the Google search page (it should still have Waterloo in the search box, if not, enter that word again). Click on the menu item Images at the top of the page (and then click on Search Images, if needed). What does Google return now? Try clicking instead on Maps and see what happens. 5) Use the back button on your browser to go back again to the first Google search page. This time enter the words Waterloo in the heart of southwestern Ontario, and click Google Search. How many webpages are reported as part of the results? Can you characterize which webpages are matched? 6) Again from the Google search page, enter "Waterloo in the heart of southwestern Ontario" (including the quotation marks) and compare what Google returns. Try again, but replace southwestern with central, noting again which pages are matched. 7) Go back to the Google search page and click on Advanced Search (next to the query box). From this page you can specify that all words must appear or that a choice of words should appear, and you can restrict various properties of the page (such as language, website, or recency). At the top of the page you can also find a link to Advanced Search Tips explaining how to formulate more sophisticated queries. We'll start by looking at how search engines support simple queries and then look at some more advanced features in the next module. [1] http://googleblog.blogspot.com/2008/07/we-‐knew-‐web-‐was-‐big.html © 2009, University of Waterloo 8.1 Structure of a Text Index How do you quickly find which pages of a textbook describe computing, or twelve-‐tone composition, or green newts? You might try the table of contents to find relevant sections and then browse from there. A better way is to refer to the index located at the back of the book, which lists some words or phrases from the book and the pages on which they appear. Similarly, search engines rely on an index to find which webpages mention which words. In this section you'll learn how a text index is organized, and in the next you’ll learn how such an index is used to answer simple queries. Later modules will examine how the same index is used for more complex queries, and how it helps to present the list of matching webpages such that those most likely to be of interest are near the top of the list. 8.1.1 Inverted Files A webpage can be viewed as a sequence of words mixed in with a sequence of tags. For the time being, we'll ignore the tags and concentrate on the words that make up a webpage's content. The following example includes four pages containing one sentence each. This structure can be “inverted” so that each word references the pages on which it is found, instead of each page referencing the words found on that page. Waterloo → study → I → Pg 1 Pg 2 Pg 1 Pg 3 Pg 3 Pg 4 the → Pg 1 Pg 3 Pg 4 in → Pg 2 Pg 3 Pg 4 Bruce → Pg 3 Pg 4 Peninsula → Pg 4 … Now, given a particular search word, we can immediately find the list of pages on which that word appears, which is known as its postings list. Since the number of different words used on the Web is very large, the inverted file needs to be organized to make it easy to find the postings list for any particular word. We will assume the simplest organization, which is to keep all the words in a sorted list, using alphabetical order, just like the index in the back of a book. There are also more sophisticated structures that could be used to keep the collection of words, but that is beyond the scope of this course. Practice Exercises 1) For the four example pages above, show the postings lists for the words Canada, candlestick, is, to, and visit. 2) Examine your postings lists to see whether any pages use both words Canada and visit. 8.1.2 Postings Lists We have been showing postings lists that are simply lists of pages, but they are usually more complicated. For example, each list is normally preceded by a count of how many postings are on the list. What else might we want to store in the list for each page? That is, what should we include in each individual entry on the list (known as a posting)? Each posting must include an identifier that uniquely determines the page being referenced. For webpages, a convenient identifier is the URI for the page, and so for simplicity this is what we’ll assume is stored (although in practice it is too long to be repeated on many postings lists). A page that mentions Waterloo repeatedly is more likely to be about Waterloo than a page that mentions it only once. Therefore, a second value is stored in each posting showing the number of times that word appears on the page identified by the URI. Webpages that use the word Waterloo near the beginning (for example, in the title or the first heading), are also more likely to be about Waterloo than pages that use the word later in the text. Furthermore, if we were looking for a phrase, such as University of Waterloo, then we would need to store exactly where on the page the word appears. We define the offset of a word occurrence on a webpage as the count of that word on the page starting with 1 and proceeding from word to word in order of occurrence on the page. A posting, then, also includes the offsets of all occurrences of the word on the webpage. In summary, the postings list for the word the in the previous example would look like this: the → 3: www.Pg1; 1: 1 www.Pg3; 2: 6, 9 www.Pg4; 2: 1, 5 Where we assume that www.Pg1 is the URI for Pg 1, etc., the count appears before each colon, and the offsets of the occurrences appear after each colon. So that longer postings lists can be used for our examples, we will also occasionally write the lists vertically. This same example would then be illustrated as follows: the → 3: www.Pg1; 1: 1 www.Pg3; 2: 6, 9 www.Pg4; 2: 1, 5 Practice Exercises 3) Using this same format, show the detailed postings lists for the words Canada, candlestick, is, to, and visit. 4) Using the postings lists only, check whether the words Canada and visit ever appear next to each other on any of the pages. Do they appear within eight words of each other on any page? 8.1.3 What is a Word? We’ve been pretty casual about our attitude towards identifying words, assuming that for any webpage you would all agree on what are the words in that page. This is not quite as straightforward as it might seem at first glance. A first attempt at defining precisely what makes up a word might be that a word is any sequence of characters surrounded by blanks. However, the first word does not have a blank before it, and last one might not have a blank after it. Furthermore, on Pg 1 above, we don’t really want to treat great, as a word that includes the trailing comma, nor do we want to include the period with the word visit. Some pages might include tab characters or other types of white space, and many will omit a blank between the last word on one line and the first word on the next; instead the carriage return, linefeed character, or both together separate these words. We might then refine our definition of what makes up a word to be any sequence of characters that does not include punctuation marks or white space characters. Consider, however, south-‐western on Pg 2 above; should this be counted as one word or two? How about U.S. or AC/DC? In fact, what should we do with numbers such as 3/4, 3.14, or 6*1023? As you can see, nothing is easy! A text indexing system must work from a highly detailed definition of which sequences of characters to consider being words, and the process of separating a document into words is known as word segmentation. (Word segmentation is even more severe a problem in languages where large words are created by stringing together smaller ones — such as German, Hungarian, Korean, and Turkish — and in languages where most word boundaries are not indicated at all — such as Chinese, Japanese, and Thai.) Perhaps the simplest assumption is that only sequences of alphabetic characters (a..z and A..Z) are to be treated as “words” (but then you wouldn’t be able to find the Stars Wars characters R2-‐D2 or C-‐3PO as easily). Instead, every time you are asked to index documents, you must be told explicitly which characters are to be included as parts of words, which should be ignored, and which should be treated as word breaks (i.e., as if they were merely white space). Let’s now consider another assumption that we made above: that the words The and the were to be treated as being the same word. This is a far simpler problem to address: before indexing a text, we will convert every upper-‐case letter into a lower-‐case one. The process is called case folding, and it is universally applied by search engines. (The few instances where this might cause problems, such as confusing a possible abbreviation for United States, namely US, with the common word us, can be handled as exceptions or simply ignored.) An extension of this technique (to replace variants of a letter by a chosen representative for that letter) is to remove diacritic marks from letters, thus replacing é by e, Å by a, and ç by c. (Although this technique works well for users most comfortable in English, it is not necessarily beneficial for other users, for whom diacritics are often important elements in their queries.) A practical problem that remains is the presence of very common words. The word the, for example, appears repeatedly on almost every webpage. The postings list for this word will be extremely long, and it is unlikely to be useful in identifying which documents a user wishes to see in response to a query. Many indexers therefore work with a list of stop words, a list of the most common words that are to be omitted from the index. This list usually includes articles (a, an, the), prepositions (in, of, over), pronouns (he, her, its), connectives (and, because, but), and other function words (have, not, to). Even though no postings list is created for a stop word, each occurrence of the word is included when calculating offsets for other words. A list of 89 common words in English[1] may be used as the stop word list: a about after against all also although among an and are as at be became because been between but by can come do during each early for form found from had has have he her his however in include including into is it its late later made many may me more most near no non not of on only or other over several she some such than that the their then there these they this through to under until use was we were when where which who with you A more extensive stop word list was compiled by C. J. van Rijsbergen for his classic text on information retrieval, and even longer ones may be used. Stop word lists have also been compiled for other languages[2]. (Recently, search engines for the Web have chosen to treat stop words with far more sophistication, so that phrase such as “The Who” and “To be or not to be” can be retrieved efficiently.) At the other extreme is the problem of exceptionally long character strings that may appear on some webpages. Again, no user will be searching for such terms, and, unfortunately, there are many such terms, each of which occurs extremely rarely, often only once. Many search engines, therefore, choose to omit such hapax legomena (which are often spelling errors or just pure nonsense) from their indexes in order not to clutter the indexes unnecessarily. Finally, we need to address the problem of word variants. A user who looks for pages containing the word sell would almost certainly also expect to find pages with the word sells. Which of sell, sells, selling, sold, resell, resold, and unsold would be better to index as if they were the same word? Again, this problem is even more noticeable in languages other than English, where many more word forms exist to reflect rich adjective and noun declensions and verb conjugations in many more tenses than are used in English. Because the problem of finding the morphological roots of words is quite difficult, some search engines have adopted the use of word stemming, which uses simple rules to map similar words to common bases, called stems. For example, a stemmer might strip off trailing endings such as –s and –ed, as well as stripping off leading prefixes such as re-‐ and un-‐. The result of these considerations is that search engines rarely refer to query word and index word, but instead refer to query term and index term. We will adapt this same convention, and use term rather than word when discussing querying and indexing. Practice Exercises For this set of exercises, assume that each row in the following table indicates the unique identifier (pageid) for a page and the text contents found on that page. pageid text contents command (Mac) Works in combination with other keys to provide keyboard shortcuts or to perform special functions control Like the shift key, this key works by holding it down while pressing another key. Control is used to enter short-‐cut commands from the keyboard, such as ctrl-‐C to copy text on a PC and ctrl-‐V to paste it. delete Removes the selected text; if nothing is selected, deletes text to the left of the insertion point (on a Mac) or to the right of the insertion point (on a PC); some Mac keyboards also have delete-‐forward keys eject Used to open and close the CD drive enter (PC) Moves the cursor to the next line or used to select the default button in a dialog box. option Used with several characters to provide special symbols, such as ≈, Ω or √ , and works with the command and control keys to provide additional functions return (Mac) Moves the cursor to the next line or used to select the default button in a dialog box shift Creates upper-‐case characters when pressed with a letter and special symbols when pressed with a number (such as #, $ or %) tab Causes the insertion point to move right horizontally to the next tab stop or the next area of data entry volume Allows the volume to be raised, lowered or muted 5) Assuming the simplest definition of a term (consecutive strings of upper-‐case and lower-‐case letters with no intervening digits, punctuation, or other special characters), how many terms are in the pages with pageid command, control, and option? 6) Assuming case folding but no stemming, show the postings lists for the terms key, keyboard, keys, mac, and used. 7) Assuming that the stop word list includes the most frequent English words shown in the table above and all words having three or fewer letters, on how many postings lists will you find postings for the page with pageid control? [1] http://msdn.microsoft.com/en-‐us/library/bb164590.aspx [2] http://www.ranks.nl/resources/stopwords.html © 2009, University of Waterloo 8.2 Matching a One-word Query Now that you know what an index looks like, let’s review what happens when you type a one-‐word query into a search engine’s search box. 1. First the search engine has to convert the query to the same form as its index terms. To start, the search term is converted to lower case. If it is a stop word, the engine will have to report that it cannot find a match (or, because such terms are so common that they appear on almost any page, the engine could select any small set of pages and then merely scan them to verify that they do indeed include the query term). If the term is not a stop word, then the engine can apply whatever stemming it used (if any) when creating the index. 2. Next the engine has to find the converted query term within its dictionary of index terms. If the dictionary is an alphabetically ordered list of index terms, then it looks up the term much as you would look up a word in a conventional dictionary. If no matching entry is found, the engine reports that no webpages match your query. 3. If a match to your converted query term is found, then the engine retrieves the associated postings list. Using the count of the number of postings, it can immediately report the number of webpages that match your query. It can also display the URIs for 10 of those pages (or however many you wish to see). For each URI it displays, it can also show a suitable snippet from the corresponding page. If you want to see additional matches, it can show the next 10, and then the next 10, until you’ve seen enough. How satisfied are you likely to be with the results? In an upcoming module, we’ll examine how the engine chooses which pages to show first in response to your query. For now, we’re interested in examining which webpages can be found somewhere in the collection of results. One problem we haven’t yet considered in this context results from synonymy, the existence of multiple words having the same (or almost the same) meaning. Look again at our original example with four webpages. If you were to look for pages matching the query term found, you would have no matches. However, Pg 2 uses the word located, which has essentially the same meaning, and that page might well be of interest to you when you are searching for found. Even worse, Pg 4 does not use any such verb, but clearly the inclusion of the phrase place in Canada implies that it describes the location of the Bruce Peninsula. A huge problem for search engines is to find the pages that are relevant to your query when they do not include the term or terms you enter into the search box. In fact, if you do not find the webpages you are expecting when you enter a query, it may be beneficial to try alternative words that might appear on the pages you had hoped to see. A related problem we noticed in the context of wiki articles was polysemy, the existence of multiple, often unrelated meanings for a word. In our example, the word study appears on Pg 1 and on Pg 3, but these two instances represent different meanings of the word (the first is a verb related to learn, whereas the second is a noun synonymous with den). Consider a search for jaguar: are you looking for an animal or a car? Similarly, when looking for bank, do you want webpages describing financial institutions, grounds at the edges of rivers, rows of similar objects, underwater ridges, tilting airplanes, bouncing billiard balls, or something else entirely? All these uses of bank are intermingled on its postings list and therefore indistinguishable when answering a one-‐word query. The solution, therefore, is to elaborate your query by specifying other search words that help to disambiguate among the various senses. For example, you might enter river bank, bank turn, or money bank into the search box to make it clearer which webpages will be of interest to you. We’ll examine how search engines process queries with more than one term in the next module. Finally, there is the problem of variant spellings for a word. For example, when looking for webpages mentioning colour, you probably also want to find those mentioning color; similarly, either organize and organise should return the same pages when queried. A related problem arises from the inconsistent use of hyphenation: webpages containing southwestern, south western, and south-‐western are probably all of interest when any of these forms is sought. Dealing with these many forms of words might be addressed in the same way as dealing with other word variants, such as plurals and past tenses, by augmenting the rules for stemming to include other spelling variations. However, this won’t address the problem of spelling errors, either in the query or in the text contents of webpages: it is much more difficult to match the erroneous spelling souhwestern. With all these difficulties, it is quite surprising that search engines are as useful as they are. Luckily we are not confined to entering just a single query when looking for some particular webpages. In response to what is returned, we can modify our search to retrieve additional pages or to eliminate superfluous pages from being returned. One of the most powerful techniques for such query refinement is to remove results containing a term by placing a hyphen (-‐) in front of it. For example, the query: apple –computer might help you locate pages related to apples rather than those related to the Apple Computer Corporation. We’ll examine how this works in the next module. © 2009, University of Waterloo 8.3 Supporting Millions of Users The following figure shows the basic components of a search engine and its connection to users and webpages via the Internet: A request originates from any of the users, travels via packets across the Internet to the search engine. The engine’s front end processor examines the query, converts the search terms into index terms, and passes the converted query to the lookup processor, which connects the index terms to the corresponding postings lists in the index and combines them to produce matches. The best matches are passed back to the front end processor, which formats a result page for the user. When the user receives the results page back from the search engine, he or she may use it to access some of the webpages via the URIs returned, formulate another query to pass to the search engine, or go on to do something completely different. Even though computers are fast, a search engine can be quickly overwhelmed by hundreds of millions of queries per day from users all over the Internet. Therefore, practical search engines do not have just one front end processor and one lookup processor, but instead include hundreds or thousands of computers linked together to form one seamless service. Your search request can be handled by any one of the front end processors. Each processor can forward the converted query to an available lookup processor, which consults its index. For example, the red lines indicate one possible path that your request might follow. Each processor remembers where the request originated, so that it can return its results back along the same path. As demand for service grows, more computers can be added as front end processors or as lookup processors, and more copies of the index can be created as needed, to provide fast responses to everyone’s queries. To handle larger amounts of data, each copy of the index can be split across many computers. © 2009, University of Waterloo Module 8 Summary The indexing of webpages allows a search engine to answer user queries quickly and effectively. The technology in current web search engines has evolved from traditional information retrieval systems, which were first developed around 40 years ago, and which in turn evolved from conventional library indexing systems. Some Key Terms case folding front end processor index index term inverted file lookup processor match posting postings list query search term stemming stop word variant spelling word offset word segmentation Skills manually applying stop word lists against documents manually creating postings lists from documents manually tracing user queries through the steps applied by Web search engines to produce resulting matches © 2009, University of Waterloo