SMU CSE 5337/7337 Spring 2015

advertisement
Hmwk21
1. Okay: 6, 10, 12, 15
2. 40 tokens, 31 terms
The the are two tokens, but one term
3.
4.
5.
6.
a. okay, b. organ <> organization, one <> on
Okay
n/a
Just stating “1 token” is not enough, show some
detail. What did you try; what was the result?
02/17/15.1
Hmwk22
• Exercise 20.1: Why is it better to partition hosts
(rather than individual URLs) between the nodes
of a distributed crawl system?
• This allows for “politeness” measures to be
implemented, preventing excessive calls to a
particular machine in short periods of time, by
caching data from one node.
• Consider a web hosting company with multiple
websites, i.e. a pool of ip addresses. Grouping
hosts by ip address could also support
“politeness”.
02/17/15.2
Hmwk23
• Exercise 20.2: Why should the host splitter
precede the Duplicate URL Eliminator (DUE)?
• Host splitters can be cached, allowing the host
splitter to discard duplicate URLs, thereby
reducing network traffic. The same goes for DUE
but the DUE requires significant RAM storage on
each machine. Using the host splitter first, slows
the growth rate of the cache and reduces redundant
RAM usage.
02/17/15.3
Download