ppt

advertisement
i206: Lecture 22:
Web Search Engines; Recap
Distributed Systems
Marti Hearst
Spring 2012
1
How Search Engines Work
• There are MANY issues
• I’m only giving the basics
2
How Search Engines Work
Three main parts:
i.
Gather the contents of all web pages (using a
program called a crawler or spider)
ii. Organize the contents of the pages in a way
that allows efficient retrieval (indexing)
iii. Take in a query, determine which pages
match, and show the results (ranking and
display of results)
Slide adapted from Lew & Davis
3
Standard Web Search Engine Architecture
Crawler
machines
Crawl the
web
Check for duplicates,
store the
documents
DocIds
Create an
inverted
index
Search
engine
servers
Inverted
index
4
Standard Web Search Engine Architecture
Crawler
machines
Crawl the
web
Check for duplicates,
store the
documents
DocIds
Create an
inverted
index
user
query
Show results
to user
Search
engine
servers
Inverted
index
5
Search engine
architecture,
from “Anatomy of a
Large-Scale Hypertext
Web Search Engine”,
Brin & Page, 1998.
http://dbpubs.stanford.edu:8090/pub/1998-8
6
Spiders or crawlers
• How to find web pages to visit and copy?
– Can start with a list of domain names, visit
the home pages there.
– Look at the hyperlinks on the home page, and
follow those links to more pages.
• Use HTTP commands to GET the pages
– Keep a list of urls visited, and those still to be
visited.
– Each time the program loads in a new HTML
page, add the links in that page to the list to
be crawled.
Slide adapted from Lew & Davis
7
Spider behavior varies
•
•
•
•
Parts of a web page that are indexed
How deeply a site is indexed
Types of files indexed
How frequently the site is spidered
Slide adapted from Lew & Davis
8
Four Laws of Crawling
• A Crawler must show identification
• A Crawler must obey the robots
exclusion standard
http://www.robotstxt.org/wc/norobots.html
• A Crawler must not hog resources
• A Crawler must report errors
9
The Internet Is Enormous
Image from http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.html
10
Lots of tricky aspects
•
•
•
•
Servers are often down or slow
Hyperlinks can get the crawler into cycles
Some websites have junk in the web pages
Now many pages have dynamic content
– Javascript
– The “hidden” web
– E.g., schedule.berkeley.edu
• You don’t see the course schedules until you
run a query.
• The web is HUGE
11
“Freshness”
• Need to keep checking pages
– Pages change
• At different frequencies
• Pages are removed
– Many search engines cache the pages (store a copy
on their own servers) to save time/effort
• But pages that change a lot stymie this strategy
12
What really gets crawled?
• A small fraction of the Web that search
engines know about; no search engine is
exhaustive
• Not the “live” Web, but the search engine’s
index
• Not the “Deep Web”
• Mostly HTML pages but other file types too:
PDF, Word, PPT, etc.
Slide adapted from Lew & Davis
13
ii. Index (the database)
Record information about each page
• List of words
–
–
–
In the title?
How far down in the page?
Was the word in boldface?
• URLs of pages pointing to this one
• Anchor text on pages pointing to this one
Slide adapted from Lew & Davis
14
The importance of anchor text
<a href=http://courses.ischool…>
i141 </a>
<a href=http://courses.ischool…>
A terrific course on search
engines </a>
The anchor text summarizes
what the website is about.
15
Inverted Index
• How to store the words for fast lookup
• Basic steps:
– Make a “dictionary” of all the words in all of the
web pages
– For each word, list all the documents it occurs in.
– Often omit very common words
• “stop words”
– Sometimes stem the words
• (also called morphological analysis)
• cats -> cat
• running -> run
16
Inverted Index Example
Image from http://developer.apple.com
/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html
17
Inverted Index
• In reality, this index is HUGE
• Need to store the contents across many
machines
• Need to do optimization tricks to make lookup
fast.
18
Query Serving Architecture
“travel
”
Load Balancer
“travel
”
FE1
FE2
…
“travel
”
QI1
QI2
…
FE8
…
QI8
…
“travel
”
Node1,1 Node1,2 Node1,3
Node2,1 Node2,2 Node2,3
Node3,1 Node3,2 Node3,3
Node4,1 Node4,2 Node4,3
“travel
”
…
…
…
…
Node1,N
Node2,N
• Index divided into
segments each served by
a node
• Each row of nodes
replicated for query load
• Query integrator
distributes query and
merges results
• Front end creates a HTML
page with the query
results
Node3,N
Node4,N
19
Results ranking
• Search engine receives a query, then
• Looks up the words in the index, retrieves
many documents, then
• Rank orders the pages and extracts “snippets”
or summaries containing query words.
– In the early days; statistical
– Next: implicit AND,
– Now: much more complex
• These are complex and highly guarded
algorithms unique to each search engine.
Slide adapted from Lew & Davis
20
Some ranking criteria
• For a given candidate result page, use:
–
–
–
–
–
–
–
–
–
Number of matching query words in the page
Proximity of matching words to one another
Location of terms within the page
Location of terms within tags e.g. <title>, <h1>,
link text, body text
Anchor text on pages pointing to this one
Frequency of terms on the page and in general
Link analysis of which pages point to this one
(Sometimes) Click-through analysis: how often the
page is clicked on
How “fresh” is the page
• Complex formulae combine these together.
Slide adapted from Lew & Davis
21
Measuring Importance of Linking
• PageRank Algorithm
– Idea: important pages are pointed
to by other important pages
– Method:
• Each link from one page to another is counted
as a “vote” for the destination page
• But the importance of the starting page also
influences the importance of the destination
page.
• And those pages scores, in turn, depend on
those linking to them.
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188
22
Measuring Importance of Linking
• Example: each page starts with 100
points.
• Each page’s score is recalculated by
adding up the score from each
incoming link.
– This is the score of the linking page
divided by the number of outgoing
links it has.
– E.g, the page in green has 2
outgoing links and so its “points”
are shared evenly by the 2 pages it
links to.
• Keep repeating the score updates until
no more changes.
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188
23
Bad Actors on the Web (Spam)
•
Cloaking
– Serve fake content to search engine robot
– DNS cloaking: Switch IP address. Impersonate
• Doorway pages
– Pages optimized for a single keyword that redirect to the real target page
• Keyword Spam
– Misleading meta-keywords, excessive repetition
of a term, fake “anchor text”
– Hidden text with colors, CSS tricks, etc.
Cloaking
Y
SPAM
Is this a Search
Engine spider?
N
Real
Doc
• Link spamming
– Mutual admiration societies, hidden links,
awards
– Domain flooding: numerous domains that point
or re-direct to a target page
• Robots
– Fake click stream
– Fake query stream
– Millions of submissions via Add-Url
Meta-Keywords =
“… London hotels, hotel, holiday inn, hilton,
discount, booking, reservation, sex, mp3,
britney spears, viagra, …”
Slide adapted from Manning, Raghavan, & Schuetze
24
Inter-Related Topics
• Networked Systems
• Distributed Systems
– Web search engines
– Tools like Hadoop
• Web security
• Cryptography (next time)
25
Motivation for Hadoop
• How do you scale up applications?
– Run jobs processing 100’s of terabytes of data
– Takes 11 days to read on 1 computer
• Need lots of cheap computers
– Fixes speed problem (15 minutes on 1000
computers), but…
– Reliability problems
• In large clusters, computers fail every day
• Cluster size is not fixed
• Need common infrastructure
– Must be efficient and reliable
Slide adapted from Xiaoxiao Shi, Guan Wang
26
Query Serving Architecture
“travel
”
Load Balancer
“travel
”
FE1
FE2
…
“travel
”
QI1
QI2
…
FE8
…
QI8
…
“travel
”
Node1,1 Node1,2 Node1,3
Node2,1 Node2,2 Node2,3
Node3,1 Node3,2 Node3,3
Node4,1 Node4,2 Node4,3
“travel
”
…
…
…
…
Node1,N
Node2,N
• Index divided into
segments each served by
a node
• Each row of nodes
replicated for query load
• Query integrator
distributes query and
merges results
• Front end creates a HTML
page with the query
results
Node3,N
Node4,N
27
Hadoop
• Open Source Apache Project
• Hadoop Core includes:
– Distributed File System - distributes data
– Map/Reduce - distributes application
• Written in Java
• Runs on
– Linux, Mac OS/X, Windows, and Solaris
– Commodity hardware
Slide adapted from Xiaoxiao Shi, Guan Wang
28
Hadoop Users
• Who use Hadoop?
• Amazon/A9
• AOL
• Facebook
• Fox interactive media
• Google
• IBM
• New York Times
• PowerSet (now Microsoft)
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• More at http://wiki.apache.org/hadoop/PoweredBy
Slide adapted from Xiaoxiao Shi, Guan Wang
29
Typical Hadoop Structure
• Commodity hardware
– Linux PCs with local 4 disks
• Typically in 2 level architecture
– 40 nodes/rack
– Uplink from rack is 8 gigabit
– Rack-internal is 1 gigabit all-to-all
Slide adapted from Xiaoxiao Shi, Guan Wang
30
Slide adapted from Xiaoxiao Shi, Guan Wang
31
Slide adapted from Xiaoxiao Shi, Guan Wang
32
Hadoop structure
• Single namespace for entire cluster
– Managed by a single namenode.
– Files are single-writer and append-only.
– Optimized for streaming reads of large files.
• Files are broken into large blocks.
– Typically 128 MB
– Replicated to several datanodes, for reliability
• Client talks to both namenode and datanodes
– Data is not sent through the namenode.
– Throughput of file system scales nearly linearly with the
number of nodes.
• Access from Java, C, or command line.
Slide adapted from Xiaoxiao Shi, Guan Wang
33
Map Reduce Architecture
Slide adapted from Xiaoxiao Shi, Guan Wang
34
Example of Hadoop
Programming
• Intuition: design <key, value>
• Assume each node will process a paragraph…
• Map:
– What is the key?
– What is the value?
• Reduce:
– What to collect?
– What to reduce?
Slide adapted from Xiaoxiao Shi, Guan Wang
35
mapper.py
36
37
reducer.py
38
Check the Results
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
39
Twitter Course This Fall!
• Big Data Analysis with Twitter
• Topics will include:
–
–
–
–
–
Large Scale Anomaly Detection (at Twitter)
Intro to Pig and Scalding
Recommendation Algorithms
Real-time Search
Information Diffusion and Outbreak Detection on
Twitter
– Trend detection in social streams
– Graph Algorithms for the Social Graph
40
Next Two Lectures + Additional Review
• Apr 24: Cryptography (Prof. John Chuang)
• Apr 26: Review (Monica and Alex)
• Tues May 1: Optional Review (Marti)
41
Download