Computer Science 1000 Information Searching II strictly prohibited

advertisement
Computer Science 1000
Information Searching II
Permission to redistribute these slides is strictly prohibited without permission

Search Engine
a collection of computer programs designed
to help us find information on the Web
 typically served through a website
 different search providers exist, but basic
functionality is consistent

type keywords into a text box
 page returns links to other pages


Search Engine

why is a search engine like an index?
recall that an index maps keywords to a location
in some medium (like a page number in a book)
 a search engine does a very similar thing




takes keywords of interest from a user
maps these keywords to relevant web pages
in fact, one of the key components of a search
engine is its index

Search Engine

what differentiates a search engine from
other indexes (like a book index)?

the ability to quickly combine keywords in
searches

e.g. search for information on ducks and foxes
result ranking
 personalization
 among others …


Search Engine – How it Works
different search engines employ different
technologies
 the full details of commercial search
engines are typically not public
 however, some of the basics are consistent

crawling
 indexing
 query processing


Crawling


for a search engine to be able to link to a web
page, it must know about its existence
search engines find pages by crawling the web




programs called crawlers or spiders
e.g. Googlebot
a crawler visits web pages, in much the same way
that you do
as each page is visited, information is remembered
about the page (indexing)

Crawling – Todo List



the todo list is a list of pages that
are visited by the crawler
the crawling process starts with
an initial to-do list, populated with
sites from previous crawls
however, the list is updated as the
crawl takes place

hyperlinks on visited sites are added
to the list
Todo List
http://www.uleth.ca
http://www.tsn.ca
http://www.usask.ca
...

Crawling – Example

suppose that this page was being
processed by a crawler

Kev's Page
Favorite Stuff:
• New York Islanders
• Saskatchewan Roughriders
• John Deere

as a consequence of this page
being crawled, its links would be
added to the todo list (if they aren't
already there)
those pages would subsequently
be checked by the crawler at some
point

The "Invisible Web"

not all information is crawled, which means
it are not visible to search engines
some pages are new, and haven't yet had a
chance to be crawled
 however, there are other reasons that certain
information does not get crawled


The "Invisible Web"

1) No hyperlinks to that page

recall that in order for a page to be crawled, it must be:



Todo List
Page 1
Page 2
Page 3
on the todo list
be linked to a page that appears on the todo list
without a hyperlink, that page will never be found
Web pages
Page 1
Page 2
Page 4
Page 4
Page 3
Page 6
Page 5
Page 6
Page 5 will not
be crawled, as
it is not on the
to-do list, and
no other pages
link to it.

The "Invisible Web"

2) The Page is synthetic


a synthetic page is created on demand, depending on
user input
e.g. the results of a search on another search engine
My personal search for "New
York Islanders" on Bing results
in an on-demand page that is
not stored. Hence, it will not be
crawled.

The "Invisible Web"

3) The content is unreadable to the crawler


search engines are primarily text-based
certain data, such as movie content, is not crawlable
The webpage containing the
movie might be crawled, but
not the movie itself.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=72746

The "Invisible Web"

4) The content is password-protected

if you require a password to access a page, then so does
a search engine*

The "Invisible Web"

5) You ask the search engine to
ignore your site

the presence of certain files stored
with your website will restrict your site
from being crawled



e.g. The Robots Exclusion Protocol
a file called robots.txt can be stored that
will request that your site (or just certain
pages) are not indexed
unlike the previous four examples,
this does not prevent search engines
from crawling your site

Example:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
they can choose to ignore robots.txt
http://www.robotstxt.org/

Indexing


the primary role of the crawler is to build an index
an index is a list of tokens



each token is associated with a list of URLs



words
phrases (not considered here)*
in other words, like a book index, but with page URLs instead of
page numbers
other information might be stored with URLs (e.g. page location
of token)
these indexes are saved by the search provider

search queries use information from the indexes (fast), rather
than crawling the web for each query (slow)
*http://www.google.com/patents/US7536408

Index Lists – Example
* from text – Figure number might be different

Indexing – What Makes a Token?

page text


a common approach
search providers differ on which text is selected*


some may use all text
others may only use certain text, such as:





titles and headings
frequently occuring words
words occuring early in a page
sometimes, stop words (a, an, the) are ignored
hyperlink text

the term from a hyperlink on another page may be used to
describe the page that it links to
*http://computer.howstuffworks.com/internet/basics/search-engine1.htm

Query Processing


the part of the search engine that we see
the query processor:



modern query processors:




reads words/phrases from the user interface
returns pages that are relevant to that query
are extremely fast
are very accurate
allow a considerable variety in their capabilities
how does this all work?

Query Processing – How it works
let's start simple: suppose we search for a
single word (e.g. cat)
 in a nutshell:


the search engine finds the list for the token 'cat'

contains list of pages that contain 'cat' in the appropriate text
(e.g. title)
this list is ranked according to perceived
relevance
 the ranked list is returned as an ordered set of
hyperlinks


Query Processing – How it works

Step 1: the search engine finds the list for
the token 'cat'

Query Processing – How it works

Step 2: this list is ranked according to
perceived relevance
www.cat.com
en.wikipedia.org/wiki/Cat
www.youtube.com/watch?v=J---aiyznGQ
...

Query Processing – How it works

Step 3: the ranked list is returned as an
ordered set of hyperlinks
www.cat.com
en.wikipedia.org/wiki/Cat
www.youtube.com/watch?v=J---aiyznGQ
...

Query Processing

what about multi-word searching?
as mentioned, some search engines index
phrases as well
 however, what if a particular phrase is not
indexed?



e.g. (text) red fish guppy
solution: intersecting queries

the webpages that are common to all of the search words
are returned

Intersecting Queries


example (text): suppose the query was “red fish guppy”
further suppose that the indexes for each word were as
follows:


result is the set of sites that contain all of the keywords
in other words, the sites that are found on all three lists
red:
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.fullredguppy.com
www.fullredguppy.com
www.red.com
www.sciencedaily.com
www.sciencedaily.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
guppy:
en.wikipedia.org/wiki/guppy
www.ifga.org
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
www.tropicalfish.com
Result:
www.fullredguppy.com
www.sciencedaily.com

Intersecting Queries - Efficiency

the size of index lists can be large

'cat' returns over 2.3 billion results
modern search engines are fast
 hence, clever algorithms must be developed
for optimizing queries
 example: intersecting queries


Intersecting Queries - Efficiency

suppose you had two search terms




e.g. red and fish
the query processor has a list for tokens
suppose each list contained 1 billion tokens
let's consider a method for performing the
intersecting query

that is, how do we find all pages that occur on both lists?

The Naive Approach

for each entry in the 'red' list
search through the entire 'fish' list
 if we find the entry from the red list, then add
that to our result

red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:

The Naive Approach
First search: www.sciencedaily.com
 do we find it in second list?


yes – add it to result
red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:
www.sciencedaily.com

The Naive Approach
Second search: en.wikipedia.org/wiki/red
 do we find it in second list?


no
red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:
www.sciencedaily.com

The Naive Approach
Third search: newsroom.urc.edu
 do we find it in second list?


yes, add it to list
red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:
www.sciencedaily.com
newsroom.urc.edu

The Naive Approach
Fourth search: www.red.com
 do we find it in second list?


no
red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:
www.sciencedaily.com
newsroom.urc.ed

The Naive Approach
Fifth search: www.fullredguppy.com
 do we find it in second list?


yes – add it to list
red:
www.sciencedaily.com
en.wikipedia.org/wiki/red
newsroom.urc.edu
www.red.com
www.fullredguppy.com
fish:
en.wikipedia.org/wiki/fish
newsroom.urc.edu
www.fish.com
www.fullredguppy.com
www.sciencedaily.com
result:
www.sciencedaily.com
newsroom.urc.edu
www.fullredguppy.com

The Naive Approach

problems?
slow!!
 for each URL in left list, we potentially had to
compare it to every URL in right list
 under our previous assumption (billion size lists),
we have to do 1 billion x 1 billion comparisons
 even for a powerful computer, this would require
a considerable amount of time


Alphabetized Lists


suppose that each list was maintained
alphabetically
then we could employ the following approach


place a marker at start of each list
if markers point to same URL:




add URL to result list
move both markers down
otherwise, move the marker whose URL is
lexicographically smaller
stop when at least one marker goes off the end of the list

The Sorted Approach

place markers at the start of each list
red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:

The Sorted Approach

do markers point to same URL?
no
 since right marker's URL is less than left
marker's URL, move right marker down

red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:

The Sorted Approach

do markers point to same URL?
no
 since left marker's URL is less than right
marker's URL, move left marker down

red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:

The Sorted Approach

do markers point to same URL?

yes


add URL to result
move both markers
red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu

The Sorted Approach

do markers point to same URL?
no
 since right marker's URL is less than left
marker's URL, move right marker down

red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu

The Sorted Approach

do markers point to same URL?

yes


add URL to result
move both markers
red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu
www.fullredguppy.com

The Sorted Approach

do markers point to same URL?
no
 since left marker's URL is less than right
marker's URL, move left marker down

red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu
www.fullredguppy.com

The Sorted Approach

do markers point to same URL?

yes


add URL to result
move both markers
red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu
www.fullredguppy.com
www.sciencedaily.com

The Sorted Approach
at least one marker has completed its list,
so we can stop
 notice that our result contains correct values

red:
red:
en.wikipedia.org/wiki/red
www.sciencedaily.com
newsroom.urc.edu
en.wikipedia.org/wiki/red
www.fullredguppy.com
newsroom.urc.edu
www.red.com
www.red.com
www.sciencedaily.com
www.fullredguppy.com
fish:
fish:
en.wikipedia.org/wiki/fish
en.wikipedia.org/wiki/fish
newsroom.urc.edu
newsroom.urc.edu
www.fish.com
www.fish.com
www.fullredguppy.com
www.fullredguppy.com
www.sciencedaily.com
www.sciencedaily.com
result:
newsroom.urc.edu
www.fullredguppy.com
www.sciencedaily.com

The Sorted Approach

how many comparisons are done?
note that every step involves moving at least one
arrow
 hence, the maximum number of steps is 2 billion
 this is considerably less than (1 billion) squared
 result: a massive speedup


The Sorted Approach – Notes

remember: commercial search engines don't fully
publicize strategies


hence, some search engines may use alternate
approaches for efficient intersections
the previous strategy applies to more than two lists
simultaneously

hence, we can search for multiple tokens, rather than just
two

Example
(from text):

Ranking Results


a typical search can produce
millions of results
however, we often find what we
are looking for in the first few
results



according to Optify, first returned
result from Google gets clicked
36.4% of time
first page gets clicked through
90% of the time
how does this occur?

via a page ranking system
http://searchenginewatch.com/article/2049695/Top-Google-Result-Gets-36.4-of-Clicks-Study

Ranking Results

search providers have different ways
of ranking the results of the search

Google: PageRank



proprietary (not all details available)
some details are public (considered next)
the higher the PageRank score, the closer to
the top of the search results a page will be
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897

PageRank
a scoring system
 links from other pages add to a page's
score

Web pages
Page 1
Page 4
Page 5
Page 4
Page 2
Page 5
Page 6
Page 5
Page 3
Page 5
Page 6
Page 6



the link from Page 1 adds
to Page 4's score
the links from Pages 1,2,3
add to Page 5's score
the links from Page 2 and 3
add to Page 6's score
PageRank


the score from each page is not weighted equally
the higher a page's PageRank, the more important its
contribution is
Web pages
Page 1
Page 3
Low Rank

High Rank

Page 2
Page 4

Page 3
Page 4
suppose that Page 3
has one link (Page 1),
and Page 4 has one
link (Page 2)
since Page 2's rank is
higher than Page 1's,
then Page 4's rank will
be higher than Page
3's

PageRank – Notes


since a page is not necessarily aware of other
pages that point to it, its PageRank must be
computed by the crawler
PageRank is only part of the ranking process that
you see



Google uses over 200 factors to determine page relevancy
PageRank is one of those factors
others include location, language, personalization, etc.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897
Download