B - Lyle School of Engineering

advertisement
Information Retrieval
CSE 8337 (Part B)
Spring 2009
Some Material for these slides obtained from:
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schutze
http://informationretrieval.org
CSE 8337 Outline
•
Introduction
Simple Text Processing
Boolean Queries
•
Web Searching/Crawling
•
•
•
•
•
•
Indexes
Vector Space Model
Matching
Evaluation
CSE 8337 Spring 2009
2
Web Searching TOC




Web Overview
Searching
Ranking
Crawling
CSE 8337 Spring 2009
3
Web Overview

Size





>11.5 billion pages (2005)
Grows at more than 1 million pages a day
Google indexes over 3 billion documents
Diverse types of data
http://www.google.com/support/webse
arch/bin/topic.py?topic=8996
CSE 8337 Spring 2009
4
Web Data





Web pages
Intra-page structures
Inter-page structures
Usage data
Supplemental data



Profiles
Registration information
Cookies
CSE 8337 Spring 2009
5
Zipf’s Law Applied to Web



Distribution of frequency of occurrence
of words in text.
“Frequency of i-th most frequent word
is 1/i q times that of the most frequent
word”
http://www.nslij-genetics.org/wli/zipf/
CSE 8337 Spring 2009
6
Heap’s Law Applied to Web

Measures size of vocabulary in a text of
size n :
O (n b)

b normally less than 1
CSE 8337 Spring 2009
7
Web search basics
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
User
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
Web spider
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Search
Indexer
The Web
CSE 8337 Spring 2009
Indexes
Ad indexes
8
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
CSE 8337 Spring 2009
9
Users’ empirical evaluation of results

Quality of pages varies widely


Relevance is not enough
Other desirable qualities (non IR!!)




Precision vs. recall


On the web, recall seldom matters
What matters


Precision at 1? Precision above the fold?
Comprehensiveness – must be able to deal with obscure
queries


Content: Trustworthy, diverse, non-duplicated, well maintained
Web readability: display correctly & fast
No annoyances: pop-ups, etc
Recall matters when the number of matches is very small
User perceptions may be unscientific, but are
significant over a large aggregate
CSE 8337 Spring 2009
10
Users’ empirical evaluation of engines





Relevance and validity of results
UI – Simple, no clutter, error tolerant
Trust – Results are objective
Coverage of topics for polysemic queries
Pre/Post process tools provided




Mitigate user errors (auto spell check, search assist,…)
Explicit: Search within results, more like this, refine ...
Anticipative: related searches
Deal with idiosyncrasies

Web specific vocabulary



Impact on stemming, spell-check, etc
Web addresses typed in the search box
…
CSE 8337 Spring 2009
11
Simplest forms

First generation engines relied heavily on tf/idf


The top-ranked pages for the query maui resort were the
ones containing the most maui’s and resort’s
SEOs (Search Engine Optimization) responded with
dense repetitions of chosen terms


e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same color as the
background of the web page


Repeated terms got indexed by crawlers
But not visible to humans on browsers
Pure word density cannot
be trusted as an IR signal
CSE 8337 Spring 2009
12
Term frequency tf


The term frequency tft,d of term t in
document d is defined as the number of
times that t occurs in d.
Raw term frequency is not what we want:



A document with 10 occurrences of the term
is more relevant than a document with one
occurrence of the term.
But not 10 times more relevant.
Relevance does not increase
proportionally with term frequency.
CSE 8337 Spring 2009
13
Log-frequency weighting

The log frequency weight of term t in d is
wt,d


1  log10 tf t,d ,

0,

if tf t,d  0
otherwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over
terms t in both q and d:

score

The score is 0 if none of the query terms is present in
 tqd (1  log tf t ,d )
the document.
CSE 8337 Spring 2009
14
Document frequency

Rare terms are more informative than frequent
terms




Recall stop words
Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
A document containing this term is very likely to
be relevant to the query arachnocentric
→ We want a high weight for rare terms like
arachnocentric.
CSE 8337 Spring 2009
15
Document frequency,




continued
Consider a query term that is frequent in the
collection (e.g., high, increase, line)
For frequent terms, we want positive weights
for words like high, increase, and line, but
lower weights than for rare terms.
We will use document frequency (df) to
capture this in the score.
df ( N) is the number of documents that
contain the term
CSE 8337 Spring 2009
16
idf weight


dft is the document frequency of t: the
number of documents that contain t
 df is a measure of the informativeness of t
We define the idf (inverse document
frequency) of t by
idf t  log 10 N/df t

We use log N/dft instead of N/dft to
“dampen” the effect of idf.
Will turn out the base of the log is immaterial.
CSE 8337 Spring 2009
17
idf example, suppose N= 1 million
term
calpurnia
dft
idft
1
6
animal
100
4
sunday
1,000
3
10,000
2
100,000
1
1,000,000
0
fly
under
the
There is one idf value for each term t in a collection.
CSE 8337 Spring 2009
18
Collection vs. Document frequency


The collection frequency of t is the number of
occurrences of t in the collection, counting
multiple occurrences.
Example:
Word
Collection
Document
frequency

frequency
insurance
10440
3997
try
10422
8760
Which word is a better search term (and
should get a higher weight)?
CSE 8337 Spring 2009
19
tf-idf weighting

The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d  (1  log tf t ,d )  log 10 N / dft





Best known weighting scheme in information
retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf, tfidf, tf/idf
Increases with the number of occurrences within a
document
Increases with the rarity of the term in the
collection
CSE 8337 Spring 2009
20
Search engine optimization (Spam)



Motives
 Commercial, political, religious, lobbies
 Promotion funded by advertising budget
Operators
 Search Engine Optimizers for lobbies, companies
 Web masters
 Hosting services
Forums
 E.g., Web master world (www.webmasterworld.com)
 Search engine specific tricks
 Discussions
about academic papers 
CSE 8337
Spring 2009
21
Cloaking



Serve fake content to search engine spider
DNS cloaking: Switch IP address.
Impersonate
How do you identify a spider?
Y
SPAM
Is this a Search
Engine spider?
Cloaking
CSE 8337 Spring 2009
N
Real
Doc
22
More spam techniques



Doorway pages
 Pages optimized for a single keyword that
re-direct to the real target page
Link spamming
 Mutual admiration societies, hidden links,
awards – more on these later
 Domain flooding: numerous domains that
point or re-direct to a target page
Robots
 Fake query stream – rank checking programs
CSE 8337 Spring 2009
23
The war against spam

Quality signals - Prefer
authoritative pages based
on:




Anti robot test
Limits on meta-keywords
Robust link analysis


Ignore statistically implausible
linkage (or text)
Use link analysis to detect
spammers (guilt by
association)
CSE 8337 Spring 2009
Spam recognition by
machine learning




Training set based on
known spam
Family friendly filters

Policing of URL
submissions


Votes from authors (linkage
signals)
Votes from users (usage
signals)

Linguistic analysis, general
classification techniques,
etc.
For images: flesh tone
detectors, source text
analysis, etc.
Editorial intervention




Blacklists
Top queries audited
Complaints addressed
Suspect pattern detection
24
More on spam

Web search engines have policies on
SEO practices they tolerate/block




http://help.yahoo.com/help/us/ysearch/index.html
http://www.google.com/intl/en/webmasters/
Adversarial IR: the unending (technical)
battle between SEO’s and web search
engines
Research http://airweb.cse.lehigh.edu
CSE 8337 Spring 2009
25
Ranking



Order documents based on relevance to query
(similarity measure)
Ranking has to be performed without accessing the
text, just the index
About ranking algorithms, all information is “top
secret”, it is almost impossible to measure recall, as
the number of relevant pages can be quite large for
simple queries
CSE 8337 Spring 2009
26
Ranking



Some of the new ranking algorithms also use
hyperlink information
Important
normal IR
that point
popularity
difference between the Web and
databases, the number of hyperlinks
to a page provides a measure of its
and quality.
Links in common between pages often indicate a
relationship between those pages.
CSE 8337 Spring 2009
27
Ranking

Three examples of ranking techniques based in
link analysis:
 WebQuery
 HITS (Hub/Authority pages)
 PageRank
CSE 8337 Spring 2009
28
WebQuery


WebQuery takes a set of Web pages (for example, the
answer to a query) and ranks them based on how
connected each Web page is
http://www.cgl.uwaterloo.ca/Projects/Vanish/webquer
y-1.html
CSE 8337 Spring 2009
29
HITS


Kleinberg ranking scheme depends on the query
and considers the set of pages S that point to
or are pointed by pages in the answer
 Pages
that have many links pointing to them
in S are called authorities
 Pages that have many outgoing links are
called hubs
Better authority pages come from incoming
edges from good hubs and better hub pages
come from outgoing edges to good authorities
CSE 8337 Spring 2009
30
Ranking
H ( p) 
 A(u)
uSp u
A( p) 
 H (v )
vSv  p
CSE 8337 Spring 2009
31
PageRank




Used in Google
PageRank simulates a user navigating randomly
in the Web who jumps to a random page with
probability q or follows a random hyperlink (on
the current page) with probability 1 - a
This process can be modeled with a Markov
chain, from where the stationary probability of
being in each page can be computed
Let C(a) be the number of outgoing links of
page a and suppose that page a is pointed to by
pages p1 to pn
CSE 8337 Spring 2009
32
PageRank (cont’d)

PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)


PR(i): PageRank for a page i which points
to target page p.
Ni: number of links coming out of page I
CSE 8337 Spring 2009
33
Conclusion



Nowadays search engines use, basically,
Boolean or Vector models and their
variations
Link Analysis Techniques seem to be the
“next generation” of the search engines
Indexes: Compression and distributed
architecture are keys
CSE 8337 Spring 2009
34
Crawlers







Robot (spider) traverses the hypertext
sructure in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler – visits entire Web (?)
and replaces index
Periodic Crawler – visits portions of the
Web and updates subset of index
Incremental Crawler – selectively searches
the Web and incrementally modifies index
Focused Crawler – visits pages related to a
particular subject
CSE 8337 Spring 2009
35
Crawling the Web

The order in which the URLs are traversed is
important



Using a breadth first policy, we first look at all the
pages linked by the current page, and so on. This
matches well Web sites that are structured by related
topics. On the other hand, the coverage will be wide
but shallow and a Web server can be bombarded with
many rapid requests
In the depth first case, we follow the first link of a
page and we do the same on that page until we
cannot go deeper, returning recursively
Good ordering schemes can make a difference if
crawling better pages first (PageRank)
CSE 8337 Spring 2009
36
Crawling the Web


Due to the fact that robots can overwhelm a
server with rapid requests and can use
significant Internet bandwidth a set of
guidelines for robot behavior has been
developed
Crawlers can also have problems with HTML
pages that use frames or image maps. In
addition, dynamically generated pages cannot
be indexed as well as password protected
pages
CSE 8337 Spring 2009
37
Focused Crawler


Only visit links from a page if that page is
determined to be relevant.
Components:





Classifier which assigns relevance score to each
page based on crawl topic.
Distiller to identify hub pages.
Crawler visits pages based on crawler and distiller
scores.
Classifier also determines how useful
outgoing links are
Hub Pages contain links to many relevant
pages. Must be visited even if not high
relevance score.
CSE 8337 Spring 2009
38
Focused Crawler
CSE 8337 Spring 2009
39
Basic crawler operation
Begin with known “seed” pages
 Fetch and parse them

Extract URLs they point to
 Place the extracted URLs on a
queue


Fetch each URL on the queue and
repeat
CSE 8337 Spring 2009
40
Crawling picture
URLs crawled
and parsed
Seed
pages
Unseen Web
URLs frontier
Web
CSE 8337 Spring 2009
41
Simple picture – complications
Web crawling isn’t feasible with one machine
 All of the above steps distributed
 Even non-malicious pages pose challenges
 Latency/bandwidth to remote servers vary
 Webmasters’ stipulations
 How “deep” should you crawl a site’s
URL hierarchy?
 Site mirrors and duplicate pages
 Malicious pages
 Spam pages
 Spider traps
 Politeness – don’t hit a server too often
CSE 8337 Spring 2009

42
What any crawler must do

Be Polite: Respect implicit and explicit
politeness considerations
Only crawl allowed pages
 Respect robots.txt (more on this
shortly)


Be Robust: Be immune to spider
traps and other malicious behavior
from web servers
CSE 8337 Spring 2009
43
What any crawler should do



Be capable of distributed operation:
designed to run on multiple distributed
machines
Be scalable: designed to increase the
crawl rate by adding more machines
Performance/efficiency: permit full use of
available processing and network
resources
CSE 8337 Spring 2009
44
What any crawler should do



Fetch pages of “higher quality” first
Continuous operation: Continue
fetching fresh copies of a previously
fetched page
Extensible: Adapt to new data
formats, protocols
CSE 8337 Spring 2009
45
Updated crawling picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling
CSE 8337 Springthread
2009
46
URL frontier



Can include multiple pages from the
same host
Must avoid trying to fetch them all at
the same time
Must try to keep all crawling threads
busy
CSE 8337 Spring 2009
47
Explicit and implicit politeness

Explicit politeness: specifications from
webmasters on what portions of site
can be crawled


robots.txt
Implicit politeness: even with no
specification, avoid hitting any site
too often
CSE 8337 Spring 2009
48
Robots.txt

Protocol for giving spiders (“robots”)
limited access to a website, originally from
1994


www.robotstxt.org/wc/norobots.html
Website announces its request on what
can(not) be crawled

For a URL, create a file URL/robots.txt

This file specifies access restrictions
CSE 8337 Spring 2009
49
Robots.txt example

No robot should visit any URL starting
with "/yoursite/temp/", except the robot
called “searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
CSE 8337 Spring 2009
50
Processing steps in crawling



Pick a URL from the frontier
Fetch the document at the URL
Parse the URL


Extract links from it to other docs (URLs)
Check if URL has content already seen


Which one?
If not, add to indexes
For each extracted URL
E.g., only crawl .edu,
obey robots.txt, etc.
Ensure it passes certain URL filter tests
 Check if it is already in the frontier
(duplicate URL elimination)
CSE 8337 Spring 2009

51
Basic crawl architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
URL
filter
Dup
URL
elim
Parse
Fetch
Content
seen?
URL Frontier
CSE 8337 Spring 2009
52
DNS (Domain Name Server)
A lookup service on the internet
 Given a URL, retrieve its IP address
 Service provided by a distributed set of
servers – thus, lookup latencies can be
high (even seconds)
 Common OS implementations of DNS lookup
are blocking: only one outstanding request at
a time
 Solutions
 DNS caching
 Batch DNS resolver – collects requests and
sends
them out together
CSE 8337 Spring
2009
53

Parsing: URL normalization
When a fetched document is parsed, some of the
extracted links are relative URLs
 E.g., at http://en.wikipedia.org/wiki/Main_Page
we have a relative link to
/wiki/Wikipedia:General_disclaimer which is the
same as the absolute URL

http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer

During parsing, must normalize (expand) such
relative URLs
CSE 8337 Spring 2009
54
Content seen?
Duplication is widespread on the
web
 If the page just fetched is already
in the index, do not further
process it
 This is verified using document
fingerprints or shingles

CSE 8337 Spring 2009
55
Filters and robots.txt


Filters – regular expressions for
URL’s to be crawled/not
Once a robots.txt file is fetched from
a site, need not fetch it repeatedly


Doing so burns bandwidth, hits web
server
Cache robots.txt files
CSE 8337 Spring 2009
56
Duplicate URL elimination


For a non-continuous (one-shot)
crawl, test to see if an
extracted+filtered URL has already
been passed to the frontier
For a continuous crawl – see details
of frontier implementation
CSE 8337 Spring 2009
57
Distributing the crawler

Run multiple crawl threads, under different
processes – potentially at different nodes


Partition hosts being crawled into nodes


Geographically distributed nodes
Hash used for partition
How do these nodes communicate?
CSE 8337 Spring 2009
58
URL frontier: two main considerations


Politeness: do not hit a web server too
frequently
Freshness: crawl some pages more often
than others

E.g., pages (such as News sites) whose
content changes often
These goals may conflict each other.
(E.g., simple priority queue fails – many links
out of a page go to its own site, creating a
burst of accesses to that site.)
CSE 8337 Spring 2009
59
Politeness – challenges


Even if we restrict only one thread to
fetch from a host, can hit it
repeatedly
Common heuristic: insert time gap
between successive requests to a
host that is >> time for most recent
fetch from that host
CSE 8337 Spring 2009
60
URL frontier: Mercator scheme
URLs
Prioritizer
K front queues
Biased front queue selector
Back queue router
B back queues
Single host on each
Back queue selector
CSE 8337 Spring 2009
Crawl thread requesting URL
61
Mercator URL frontier





URLs flow in from the top into the frontier
Front queues manage prioritization
Back queues enforce politeness
Each queue is FIFO
http://mercator.comm.nsdlib.org/
CSE 8337 Spring 2009
62
Front queues
Prioritizer
K
1
CSE 8337 Spring 2009
Biased front queue selector
Back queue router
63
Front queues

Prioritizer assigns to URL an integer
priority between 1 and K


Appends URL to corresponding queue
Heuristics for assigning priority


Refresh rate sampled from previous crawls
Application-specific (e.g., “crawl news sites
more often”)
CSE 8337 Spring 2009
64
Biased front queue selector


When a back queue requests a URL (in a
sequence to be described): picks a front
queue from which to pull a URL
This choice can be round robin biased to
queues of higher priority, or some more
sophisticated variant

Can be randomized
CSE 8337 Spring 2009
65
Back queues
Biased front queue selector
Back queue router
B
1
Back queue selector
CSE 8337 Spring 2009
66
Back queue invariants


Each back queue is kept non-empty while
the crawl is in progress
Each back queue only contains URLs from a
single host

Maintain a table from hosts to back queues
Host name
Back queue
…
3
1
B
CSE 8337 Spring 2009
67
Back queue heap



One entry for each back queue
The entry is the earliest time te at which
the host corresponding to the back queue
can be hit again
This earliest time is determined from


Last access to that host
Any time buffer heuristic we choose
CSE 8337 Spring 2009
68
Back queue processing





A crawler thread seeking a URL to crawl:
Extracts the root of the heap
Fetches URL at head of corresponding back
queue q (look up from table)
Checks if queue q is now empty – if so, pulls a
URL v from front queues
 If there’s already a back queue for v’s host,
append v to q and pull another URL from
front queues, repeat
 Else add v to q
When q is non-empty, create heap entry for it
CSE 8337 Spring 2009
69
Number of back queues B


Keep all threads busy while respecting
politeness
Mercator recommendation: three times as
many back queues as crawler threads
CSE 8337 Spring 2009
70
Download