Module 8 - University of Waterloo

advertisement
Module 8 Introduction
There is so much data on the Web that we cannot hope to find what we want without the help of a search
engine, such as Google, Yahoo, or Bing. How do these engines provide us with the webpages that we want to
see?
The goals of this first module examining search engines are:
to understand the structure of a Web index;
to know what changes are applied to words on a webpage when they are converted into index terms;
to learn how queries are converted into search terms;
to appreciate the need for large computer clusters to serve Web searches in practice.
More specifically, by the end of the module, you will be able to:
describe the contents of postings lists;
explain indexing concepts, including case folding, stop words, and stemming;
explain the vocabulary problems arising from word segmentation, synonymy, polysemy, and variant
spellings;
describe the main components of a simple Web search engine and how a user’s search flows through the
engine ending with the results page being returned to the user.
The online materials are supplemented by the first part of Chapter 4 in Web Dragons. You are responsible for
the material from both sources.
© 2009, University of Waterloo
8.0 Searching and Indexing
Today there are more than one trillion webpages[1], containing an incredible amount of information on every
topic imaginable. Furthermore, the number of webpages is growing by billions of pages each day. Like the
libraries of old, it is impossible to find what you are looking for by browsing around from page to page. Even if
a computer could do the browsing for you and could load 1000 pages per second, it would take over 31 years to
check every page one after the other.
As you have experienced, however, Google and other search engines can return answers to your requests
almost instantaneously. How do they do it?
Before answering that question, let’s look a bit closer at the types of searches, or queries, that Google
supports. (Similar facilities are provided by many other engines, but we’ll look at Google to make our
discussion more concrete.)
Practice Exercises
1) Open a browser window and enter Google’s web address. Notice the search box for entering your
query. Notice also the two search buttons labeled Google Search and I’m Feeling Lucky. At the top of
the window, notice links labeled Web, Images, Videos, Maps, News, and so on.
2) In the query box, enter the word Waterloo and then click Google Search. How long did it take to
respond? How many webpages does Google report as having that word? Notice that Google returns
references to several pages that it believes are most likely to match what you are searching for. For
each one, it provides the page title, a snippet from the webpage showing your search word, the URI for
the page, and links labeled cached and similar. Click on a webpage title (or URI) to see the full webpage
found by Google. Does this page match what you had in mind when you first typed the search word? Use
your browser’s back button to return to the results page and select another webpage title to examine
another match.
3) Use the back button on your browser to go back to the Google search page and enter Waterloo in
the search box again. Press I’m Feeling Lucky. Compare the page that Google returns to the list of pages
you saw previously.
4) Again, use the back button on your browser to go back to the Google search page (it should still have
Waterloo in the search box, if not, enter that word again). Click on the menu item Images at the top
of the page (and then click on Search Images, if needed). What does Google return now? Try clicking
instead on Maps and see what happens.
5) Use the back button on your browser to go back again to the first Google search page. This time enter
the words Waterloo in the heart of southwestern Ontario, and click Google Search. How
many webpages are reported as part of the results? Can you characterize which webpages are matched?
6) Again from the Google search page, enter "Waterloo in the heart of southwestern
Ontario" (including the quotation marks) and compare what Google returns. Try again, but replace
southwestern with central, noting again which pages are matched.
7) Go back to the Google search page and click on Advanced Search (next to the query box). From this
page you can specify that all words must appear or that a choice of words should appear, and you can
restrict various properties of the page (such as language, website, or recency). At the top of the page
you can also find a link to Advanced Search Tips explaining how to formulate more sophisticated
queries.
We'll start by looking at how search engines support simple queries and then look at some more advanced
features in the next module.
[1] http://googleblog.blogspot.com/2008/07/we-‐knew-‐web-‐was-‐big.html
© 2009, University of Waterloo
8.1 Structure of a Text Index
How do you quickly find which pages of a textbook describe computing, or twelve-‐tone composition, or green
newts? You might try the table of contents to find relevant sections and then browse from there. A better way
is to refer to the index located at the back of the book, which lists some words or phrases from the book and
the pages on which they appear.
Similarly, search engines rely on an index to find which webpages mention which words. In this section you'll
learn how a text index is organized, and in the next you’ll learn how such an index is used to answer simple
queries. Later modules will examine how the same index is used for more complex queries, and how it helps to
present the list of matching webpages such that those most likely to be of interest are near the top of the list.
8.1.1
Inverted Files
A webpage can be viewed as a sequence of words mixed in with a sequence of tags. For the time being, we'll
ignore the tags and concentrate on the words that make up a webpage's content. The following example
includes four pages containing one sentence each.
This structure can be “inverted” so that each word references the pages on which it is found, instead of each
page referencing the words found on that page.
Waterloo →
study
→
I →
Pg 1
Pg 2
Pg 1
Pg 3
Pg 3
Pg 4
the →
Pg 1
Pg 3
Pg 4
in →
Pg 2
Pg 3
Pg 4
Bruce →
Pg 3
Pg 4
Peninsula →
Pg 4
…
Now, given a particular search word, we can immediately find the list of pages on which that word appears,
which is known as its postings list. Since the number of different words used on the Web is very large, the
inverted file needs to be organized to make it easy to find the postings list for any particular word. We will
assume the simplest organization, which is to keep all the words in a sorted list, using alphabetical order, just
like the index in the back of a book. There are also more sophisticated structures that could be used to keep
the collection of words, but that is beyond the scope of this course.
Practice Exercises
1) For the four example pages above, show the postings lists for the words Canada, candlestick, is, to,
and visit.
2) Examine your postings lists to see whether any pages use both words Canada and visit.
8.1.2
Postings Lists
We have been showing postings lists that are simply lists of pages, but they are usually more complicated. For
example, each list is normally preceded by a count of how many postings are on the list. What else might we
want to store in the list for each page? That is, what should we include in each individual entry on the list
(known as a posting)?
Each posting must include an identifier that uniquely determines the page being referenced. For webpages, a
convenient identifier is the URI for the page, and so for simplicity this is what we’ll assume is stored (although
in practice it is too long to be repeated on many postings lists).
A page that mentions Waterloo repeatedly is more likely to be about Waterloo than a page that mentions it
only once. Therefore, a second value is stored in each posting showing the number of times that word appears
on the page identified by the URI.
Webpages that use the word Waterloo near the beginning (for example, in the title or the first heading), are
also more likely to be about Waterloo than pages that use the word later in the text. Furthermore, if we were
looking for a phrase, such as University of Waterloo, then we would need to store exactly where on the page
the word appears. We define the offset of a word occurrence on a webpage as the count of that word on the
page starting with 1 and proceeding from word to word in order of occurrence on the page. A posting, then,
also includes the offsets of all occurrences of the word on the webpage.
In summary, the postings list for the word the in the previous example would look like this:
the → 3: www.Pg1; 1: 1
www.Pg3; 2: 6, 9
www.Pg4; 2: 1, 5
Where we assume that www.Pg1 is the URI for Pg 1, etc., the count appears before each colon, and the offsets
of the occurrences appear after each colon. So that longer postings lists can be used for our examples, we will
also occasionally write the lists vertically. This same example would then be illustrated as follows:
the → 3:
www.Pg1; 1: 1
www.Pg3; 2: 6, 9
www.Pg4; 2: 1, 5
Practice Exercises
3) Using this same format, show the detailed postings lists for the words Canada, candlestick, is, to, and
visit.
4) Using the postings lists only, check whether the words Canada and visit ever appear next to each
other on any of the pages. Do they appear within eight words of each other on any page?
8.1.3
What is a Word?
We’ve been pretty casual about our attitude towards identifying words, assuming that for any webpage you
would all agree on what are the words in that page. This is not quite as straightforward as it might seem at
first glance.
A first attempt at defining precisely what makes up a word might be that a word is any sequence of characters
surrounded by blanks. However, the first word does not have a blank before it, and last one might not have a
blank after it. Furthermore, on Pg 1 above, we don’t really want to treat great, as a word that includes the
trailing comma, nor do we want to include the period with the word visit. Some pages might include tab
characters or other types of white space, and many will omit a blank between the last word on one line and
the first word on the next; instead the carriage return, linefeed character, or both together separate these
words.
We might then refine our definition of what makes up a word to be any sequence of characters that does not
include punctuation marks or white space characters. Consider, however, south-‐western on Pg 2 above; should
this be counted as one word or two? How about U.S. or AC/DC? In fact, what should we do with numbers such
as 3/4, 3.14, or 6*1023? As you can see, nothing is easy! A text indexing system must work from a highly
detailed definition of which sequences of characters to consider being words, and the process of separating a
document into words is known as word segmentation. (Word segmentation is even more severe a problem in
languages where large words are created by stringing together smaller ones — such as German, Hungarian,
Korean, and Turkish — and in languages where most word boundaries are not indicated at all — such as
Chinese, Japanese, and Thai.) Perhaps the simplest assumption is that only sequences of alphabetic characters
(a..z and A..Z) are to be treated as “words” (but then you wouldn’t be able to find the Stars Wars characters
R2-‐D2 or C-‐3PO as easily). Instead, every time you are asked to index documents, you must be told explicitly
which characters are to be included as parts of words, which should be ignored, and which should be treated as
word breaks (i.e., as if they were merely white space).
Let’s now consider another assumption that we made above: that the words The and the were to be treated as
being the same word. This is a far simpler problem to address: before indexing a text, we will convert every
upper-‐case letter into a lower-‐case one. The process is called case folding, and it is universally applied by
search engines. (The few instances where this might cause problems, such as confusing a possible abbreviation
for United States, namely US, with the common word us, can be handled as exceptions or simply ignored.) An
extension of this technique (to replace variants of a letter by a chosen representative for that letter) is to
remove diacritic marks from letters, thus replacing é by e, Å by a, and ç by c. (Although this technique works
well for users most comfortable in English, it is not necessarily beneficial for other users, for whom diacritics
are often important elements in their queries.)
A practical problem that remains is the presence of very common words. The word the, for example, appears
repeatedly on almost every webpage. The postings list for this word will be extremely long, and it is unlikely to
be useful in identifying which documents a user wishes to see in response to a query. Many indexers therefore
work with a list of stop words, a list of the most common words that are to be omitted from the index. This list
usually includes articles (a, an, the), prepositions (in, of, over), pronouns (he, her, its), connectives (and,
because, but), and other function words (have, not, to). Even though no postings list is created for a stop
word, each occurrence of the word is included when calculating offsets for other words. A list of 89 common
words in English[1] may be used as the stop word list:
a
about
after
against
all
also
although
among
an
and
are
as
at
be
became
because
been
between
but
by
can
come
do
during
each
early
for
form
found
from
had
has
have
he
her
his
however
in
include
including
into
is
it
its
late
later
made
many
may
me
more
most
near
no
non
not
of
on
only
or
other
over
several
she
some
such
than
that
the
their
then
there
these
they
this
through
to
under
until
use
was
we
were
when
where
which
who
with
you
A more extensive stop word list was compiled by C. J. van Rijsbergen for his classic text on information
retrieval, and even longer ones may be used. Stop word lists have also been compiled for other languages[2].
(Recently, search engines for the Web have chosen to treat stop words with far more sophistication, so that
phrase such as “The Who” and “To be or not to be” can be retrieved efficiently.)
At the other extreme is the problem of exceptionally long character strings that may appear on some
webpages. Again, no user will be searching for such terms, and, unfortunately, there are many such terms,
each of which occurs extremely rarely, often only once. Many search engines, therefore, choose to omit such
hapax legomena (which are often spelling errors or just pure nonsense) from their indexes in order not to
clutter the indexes unnecessarily.
Finally, we need to address the problem of word variants. A user who looks for pages containing the word sell
would almost certainly also expect to find pages with the word sells. Which of sell, sells, selling, sold, resell,
resold, and unsold would be better to index as if they were the same word? Again, this problem is even more
noticeable in languages other than English, where many more word forms exist to reflect rich adjective and
noun declensions and verb conjugations in many more tenses than are used in English. Because the problem of
finding the morphological roots of words is quite difficult, some search engines have adopted the use of word
stemming, which uses simple rules to map similar words to common bases, called stems. For example, a
stemmer might strip off trailing endings such as –s and –ed, as well as stripping off leading prefixes such as re-‐
and un-‐.
The result of these considerations is that search engines rarely refer to query word and index word, but instead
refer to query term and index term. We will adapt this same convention, and use term rather than word when
discussing querying and indexing.
Practice Exercises
For this set of exercises, assume that each row in the following table indicates the unique identifier (pageid)
for a page and the text contents found on that page.
pageid
text contents
command (Mac) Works in combination with other keys to
provide keyboard shortcuts or to perform
special functions
control Like the shift key, this key works by holding it
down while pressing another key. Control is
used to enter short-‐cut commands from the
keyboard, such as ctrl-‐C to copy text on a PC
and ctrl-‐V to paste it.
delete Removes the selected text; if nothing is
selected, deletes text to the left of the
insertion point (on a Mac) or to the right of the
insertion point (on a PC); some Mac keyboards
also have delete-‐forward keys
eject Used to open and close the CD drive
enter (PC) Moves the cursor to the next line or used
to select the default button in a dialog box.
option Used with several characters to provide special
symbols, such as ≈, Ω or √ , and works with the
command and control keys to provide
additional functions
return (Mac) Moves the cursor to the next line or used
to select the default button in a dialog box
shift Creates upper-‐case characters when pressed
with a letter and special symbols when pressed
with a number (such as #, $ or %)
tab Causes the insertion point to move right
horizontally to the next tab stop or the next
area of data entry
volume Allows the volume to be raised, lowered or
muted
5) Assuming the simplest definition of a term (consecutive strings of upper-‐case and lower-‐case
letters with no intervening digits, punctuation, or other special characters), how many terms are
in the pages with pageid command, control, and option?
6) Assuming case folding but no stemming, show the postings lists for the terms key, keyboard,
keys, mac, and used.
7) Assuming that the stop word list includes the most frequent English words shown in the table
above and all words having three or fewer letters, on how many postings lists will you find
postings for the page with pageid control?
[1] http://msdn.microsoft.com/en-‐us/library/bb164590.aspx
[2] http://www.ranks.nl/resources/stopwords.html
© 2009, University of Waterloo
8.2 Matching a One-word Query
Now that you know what an index looks like, let’s review what happens when you type a one-‐word query into a
search engine’s search box.
1. First the search engine has to convert the query to the same form as its index terms. To start, the
search term is converted to lower case. If it is a stop word, the engine will have to report that it cannot
find a match (or, because such terms are so common that they appear on almost any page, the engine
could select any small set of pages and then merely scan them to verify that they do indeed include the
query term). If the term is not a stop word, then the engine can apply whatever stemming it used (if
any) when creating the index.
2. Next the engine has to find the converted query term within its dictionary of index terms. If the
dictionary is an alphabetically ordered list of index terms, then it looks up the term much as you would
look up a word in a conventional dictionary. If no matching entry is found, the engine reports that no
webpages match your query.
3. If a match to your converted query term is found, then the engine retrieves the associated postings list.
Using the count of the number of postings, it can immediately report the number of webpages that
match your query. It can also display the URIs for 10 of those pages (or however many you wish to see).
For each URI it displays, it can also show a suitable snippet from the corresponding page. If you want to
see additional matches, it can show the next 10, and then the next 10, until you’ve seen enough.
How satisfied are you likely to be with the results?
In an upcoming module, we’ll examine how the engine chooses which pages to show first in response to your
query. For now, we’re interested in examining which webpages can be found somewhere in the collection of
results.
One problem we haven’t yet considered in this context results from synonymy, the existence of multiple words
having the same (or almost the same) meaning. Look again at our original example with four webpages. If you
were to look for pages matching the query term found, you would have no matches. However, Pg 2 uses the
word located, which has essentially the same meaning, and that page might well be of interest to you when
you are searching for found. Even worse, Pg 4 does not use any such verb, but clearly the inclusion of the
phrase place in Canada implies that it describes the location of the Bruce Peninsula. A huge problem for search
engines is to find the pages that are relevant to your query when they do not include the term or terms you
enter into the search box. In fact, if you do not find the webpages you are expecting when you enter a query,
it may be beneficial to try alternative words that might appear on the pages you had hoped to see.
A related problem we noticed in the context of wiki articles was polysemy, the existence of multiple, often
unrelated meanings for a word. In our example, the word study appears on Pg 1 and on Pg 3, but these two
instances represent different meanings of the word (the first is a verb related to learn, whereas the second is a
noun synonymous with den). Consider a search for jaguar: are you looking for an animal or a car? Similarly,
when looking for bank, do you want webpages describing financial institutions, grounds at the edges of rivers,
rows of similar objects, underwater ridges, tilting airplanes, bouncing billiard balls, or something else entirely?
All these uses of bank are intermingled on its postings list and therefore indistinguishable when answering a
one-‐word query. The solution, therefore, is to elaborate your query by specifying other search words that help
to disambiguate among the various senses. For example, you might enter river bank, bank turn, or money bank
into the search box to make it clearer which webpages will be of interest to you. We’ll examine how search
engines process queries with more than one term in the next module.
Finally, there is the problem of variant spellings for a word. For example, when looking for webpages
mentioning colour, you probably also want to find those mentioning color; similarly, either organize and
organise should return the same pages when queried. A related problem arises from the inconsistent use of
hyphenation: webpages containing southwestern, south western, and south-‐western are probably all of
interest when any of these forms is sought. Dealing with these many forms of words might be addressed in the
same way as dealing with other word variants, such as plurals and past tenses, by augmenting the rules for
stemming to include other spelling variations. However, this won’t address the problem of spelling errors,
either in the query or in the text contents of webpages: it is much more difficult to match the erroneous
spelling souhwestern.
With all these difficulties, it is quite surprising that search engines are as useful as they are. Luckily we are not
confined to entering just a single query when looking for some particular webpages. In response to what is
returned, we can modify our search to retrieve additional pages or to eliminate superfluous pages from being
returned. One of the most powerful techniques for such query refinement is to remove results containing a
term by placing a hyphen (-‐) in front of it. For example, the query:
apple –computer
might help you locate pages related to apples rather than those related to the Apple Computer Corporation.
We’ll examine how this works in the next module.
© 2009, University of Waterloo
8.3 Supporting Millions of Users
The following figure shows the basic components of a search engine and its connection to users and webpages
via the Internet:
A request originates from any of the users, travels via packets across the Internet to the search engine. The
engine’s front end processor examines the query, converts the search terms into index terms, and passes the
converted query to the lookup processor, which connects the index terms to the corresponding postings lists in
the index and combines them to produce matches. The best matches are passed back to the front end
processor, which formats a result page for the user. When the user receives the results page back from the
search engine, he or she may use it to access some of the webpages via the URIs returned, formulate another
query to pass to the search engine, or go on to do something completely different.
Even though computers are fast, a search engine can be quickly overwhelmed by hundreds of millions of
queries per day from users all over the Internet. Therefore, practical search engines do not have just one front
end processor and one lookup processor, but instead include hundreds or thousands of computers linked
together to form one seamless service.
Your search request can be handled by any one of the front end processors. Each processor can forward the
converted query to an available lookup processor, which consults its index. For example, the red lines indicate
one possible path that your request might follow. Each processor remembers where the request originated, so
that it can return its results back along the same path. As demand for service grows, more computers can be
added as front end processors or as lookup processors, and more copies of the index can be created as needed,
to provide fast responses to everyone’s queries. To handle larger amounts of data, each copy of the index can
be split across many computers.
© 2009, University of Waterloo
Module 8 Summary
The indexing of webpages allows a search engine to answer user queries quickly and effectively. The
technology in current web search engines has evolved from traditional information retrieval systems, which
were first developed around 40 years ago, and which in turn evolved from conventional library indexing
systems.
Some Key Terms
case folding
front end processor
index
index term
inverted file
lookup processor
match
posting
postings list
query
search term
stemming
stop word
variant spelling
word offset
word segmentation
Skills
manually applying stop word lists against documents
manually creating postings lists from documents
manually tracing user queries through the steps applied by Web search engines to produce resulting
matches
© 2009, University of Waterloo
Download