View/Open - YorkSpace

advertisement
Behind the Scenes at
a Search Engine
William Denton <wdenton@yorku.ca>
Web Librarian, York University Libraries
20 March 2008
http://www.library.yorku.ca/binaries/frontiers/20080320-denton-search-engine.ppt
To be covered




The three basic parts of a web search
Search engine optimization
Advertising, and how to avoid it
Library databases and the deep web
Ask questions any time.
Denton: Search Engines / 20 March 2008 / York
2
Google, Yahoo, Ask, Live, A9
www.google.com: australopithecus
search.yahoo.com: australopithecus
www.ask.com: australopithecus
search.live.com: australopithecus
a9.com: australopithecus
Denton: Search Engines / 20 March 2008 / York
3
The computing power, bandwidth,
and electricity use is mind-blowing


David F. Carr, How
Google Works
Urs Hölze talk on
Google’s Linux cluster
(2002)
Ginger Strand, Keyword: Evil
(Harper’s, March 2008)
Denton: Search Engines / 20 March 2008 / York
4
What happens when you search?
You enter in some words
You get good links
And sometimes other good stuff
Denton: Search Engines / 20 March 2008 / York
5
Three things
1. How does it know about everything?
Denton: Search Engines / 20 March 2008 / York
6
Three things
2. How does it decide what’s relevant?
Denton: Search Engines / 20 March 2008 / York
7
Three things
3. How does it serve you the results?
Denton: Search Engines / 20 March 2008 / York
8
1. How does it know about
everything?
Crawlers are continually
Different search engines
moving around the
crawl different
web, looking for
numbers of pages, but
whatever they can find.
they all do in the
billions.
Denton: Search Engines / 20 March 2008 / York
9
A visit from a Googlebot
66.249.73.229 - [09/Mar/2008:04:15:47 -0400]
"GET /ccm/jsp/homepage.jsp HTTP/1.1"
200 13670 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
http://www.google.com/bot.html) - -"
Denton: Search Engines / 20 March 2008 / York
10
robots.txt
Polite web crawlers first check the robots.txt
file on a web site and obey its rules.



Wikipedia’s robots.txt
York’s robots.txt
www.robotstxt.org
Denton: Search Engines / 20 March 2008 / York
11
Crawl, download, scan, repeat
Crawlers crawl and crawl and download all
the pages (and Word files and PDFs, and
spreadsheets) they come across. They
scan the page for links and add them to
their list of pages to crawl, ad infinitum.
Denton: Search Engines / 20 March 2008 / York
12
2. How does it decide what’s
relevant?







What’s on the page
Inbound and outbound links (PageRank)
Frequency of updates
The kind of site it is
Clickthroughs and usage analysis
People tweaking the rules
Secret other stuff
Denton: Search Engines / 20 March 2008 / York
13
Term frequency
How many times a word appears in a
document
Denton: Search Engines / 20 March 2008 / York
14
Inverse document frequency
Number of documents /
number of documents containing the term
(Actually the logarithm of this.)
Denton: Search Engines / 20 March 2008 / York
15
TF-IDF
TF-IDF of a keyword in a page = TF * IDF
Denton: Search Engines / 20 March 2008 / York
16
Example
100 web pages. Keyword: naillie
#1 has 8 mentions. TF = 8.
#2, 17, 19, 76 have 4 mentions. TF = 4.
20 pages have 1 mention. TF = 1.
IDF = log2 (100 / 25) = 2
TF-IDF of naillie in #1 = 8 * 2
= 16 High!
TF-IDF of naillie in #2, 17, 19, 76
=4*2
=8
Not so high
TF-IDF of naillie in 20 others
=1*2
=2
Small
TF-IDF of naillie in all the rest
=0*2
=0
Irrelevant
Denton: Search Engines / 20 March 2008 / York
17
HTML helps a lot
<title>The title of the page</title>
<h1>The most important heading</h1>
<h2>Lesser headings</h2>
<a href=“http://www.yorku.ca/”>Hyperlinks</a>
Text at the top of the page
Denton: Search Engines / 20 March 2008 / York
18
Semantic markup
Wrong: say how it
should look
<b><font size=“26”>
<i>Upon the Distinction
Between the Ashes of
the Various
Tobaccos</i>
</font></b>
Right: say what it is
(then apply a look)
<h1>Upon the
Distinction Between
the Ashes of the
Various
Tobaccos</h1>
Denton: Search Engines / 20 March 2008 / York
19
PageRank!




Algorithm designed by Larry Page and
Sergey Brin when they were at Stanford
One of the things Google uses in deciding
how important a web page is
US Patent 6,285,999
The Anatomy of a Large-Scale Hypertextual
Web Search Engine (Brin, Page, 1998)
Denton: Search Engines / 20 March 2008 / York
20
Denton: Search Engines / 20 March 2008 / York
21
http://en.wikipedia.org/wiki/PageRank
Wikipedia has a good explanation of PageRank
Denton: Search Engines / 20 March 2008 / York
22
Other weightings





Frequency of updates
The kind of site it is: blog? wiki? institutional
repository?
Who runs the site?
Search engine companies tweak the rules
They don’t give away their secrets, but lots
of people try to reverse engineer the
algorithms
Denton: Search Engines / 20 March 2008 / York
23
Google’s explanation
“Google combines PageRank with sophisticated
text-matching techniques to find pages that are
both important and relevant to your search.
Google goes far beyond the number of times a
term appears on a page and examines dozens of
aspects of the page's content (and the content of
the pages linking to it) to determine if it's a good
match for your query.”
http://www.google.com/technology/
Denton: Search Engines / 20 March 2008 / York
24
Live’s explanation
“Live Search website ranking is completely
automated. The Live Search ranking algorithm
analyzes factors such as web page content, the
number and quality of websites that link to your
pages, and the relevance of your website’s
content to keywords. The algorithm is complex
and is never human-mediated. You can't pay to
boost your website’s relevance ranking; however,
we do offer advertising options for website
owners.”
http://help.live.com/help.aspx?mkt=en-us&project=wl_webmasters
Denton: Search Engines / 20 March 2008 / York
25
What doesn’t it look at?



Crawlers don’t see inside fancy things like
Flash plug-ins, so if your home page is a
Flash intro, the search engines may go no
further
But they can infer what’s in an image
Most search engines ignore <meta> tags
and other metadata
Denton: Search Engines / 20 March 2008 / York
26
3. How does it serve you the results?
You enter terms,
it looks them up in a reverse index,
then it formats the results on the way out.
Denton: Search Engines / 20 March 2008 / York
27
Reverse index
naillie
1, 2, 17, 19, 76, etc.
partridge
1, 2, 35, 76, 8, 65
Not showing weights
and rankings etc.
Denton: Search Engines / 20 March 2008 / York
28
Formatting on the way out
Google: toronto shoes
Look at the
cached copy;
get the title,
keywords in
context,
etc.
Denton: Search Engines / 20 March 2008 / York
29
Search engine optimization
SEO is bringing in people by designing your
web site so that search engines will list you
high on results pages when people search
for certain keywords—without buying ads.
Denton: Search Engines / 20 March 2008 / York
30
White hat SEO
Improving your organic search results by
making your site understandable by search
engine crawlers, and by getting people to
link to you.
Denton: Search Engines / 20 March 2008 / York
31
White hat SEO
Have good content
Semantic markup again:


<title>Good Titles Matter</title>
<h1>So do headings</h1>
Use text
Host on a reliable, trustworthy site
Denton: Search Engines / 20 March 2008 / York
32
Black hat SEO
Pulling tricks to make search engines think
your site is more popular than it is or to
mislead them about the content.
Denton: Search Engines / 20 March 2008 / York
33
SEO browser extensions


Google Toolbar
Search Firefox Add-ons for “pagerank” and
“seo”, but mind the privacy issues on
what’s being reported where about the
pages you view
Denton: Search Engines / 20 March 2008 / York
34
Advertising
Search engines sell ads. You can pay to get
your web page listed at the top of the results
page for desired keywords.
Search engines are advertising companies.
Denton: Search Engines / 20 March 2008 / York
35
Google: AdWords and AdSense
AdWords is Google’s program for selling ads
on its site in results pages. Try their
Keyword Tool.
AdSense puts little boxes of context-relevant
ads on web pages of people who want to
make a little money. (Or a lot.)
Denton: Search Engines / 20 March 2008 / York
36
Yahoo


Video explaining how “sponsored search”
works
See how much it would cost to buy some
keywords
Denton: Search Engines / 20 March 2008 / York
37
Avoiding advertising
The Firefox extensions
Adblock and Customize
Google will make your
web browsing ad-free
and much more
pleasant.
Denton: Search Engines / 20 March 2008 / York
38
Denton: Search Engines / 20 March 2008 / York
39
Denton: Search Engines / 20 March 2008 / York
40
Invisible web or deep web
Search engines miss a lot of web content:
 It’s behind a login
 It’s dynamically generated
 It’s embedded in Flash or Java applets
Denton: Search Engines / 20 March 2008 / York
41
2/3 of the web goes unseen
He, Patel, Zhang, Chang, Accessing the Deep Web: A Survey
(Communications of the ACM, 50: 5, May 2007)
Denton: Search Engines / 20 March 2008 / York
42
Library databases
Library databases (JSTOR, PsycInfo, Scholars
Portal) are part of the deep web. They have
enormous amounts of information that’s
hidden from the public … except when it’s
not, as through Google Scholar.
Denton: Search Engines / 20 March 2008 / York
43
Library databases
Differences in rankings, algorithms
Full use of metadata
Usually sorted by date
PageRank is based on citation analysis, but
these databases don’t use citation analysis
to rank relevant papers
Scholars Portal: search Natural Sciences for
australopithecus
Denton: Search Engines / 20 March 2008 / York
44
Final note on privacy
Everything you do at a search engine is
logged. They track you by IP address and
with cookies and logins.
Assume all that information is stored forever.
See http://blog.searchenginewatch.com/blog/060206-150030
Denton: Search Engines / 20 March 2008 / York
45
Further reading: online





searchenginewatch.com
John Battelle’s Searchblog
Online: Exploring Technology and
Resources for Information Professionals
Wikipedia (usually quite strong on technical
articles)
The library’s Computer Science Research
Guide
Denton: Search Engines / 20 March 2008 / York
46
Further reading: books



Battelle, John. 2006. The Search: The Inside Story
of How Google and Its Rivals Changed Everything
(Portfolio)
Berry, Michael W., and Murray Browne. 2005.
Understanding Search Engines: Mathematical
Modelling and Text Retrieval (SIAM)
Levene, Mark. 2006. An Introduction to Search
Engines and Web Navigation (Addison-Wesley)
Denton: Search Engines / 20 March 2008 / York
47
Denton: Search Engines / 20 March 2008 / York
48
Download