Slides

advertisement
Uniform Sampling from the Web via
Random Walks
Ziv Bar-Yossef
Alexander Berg
Steve Chien
Jittat Fakcharoenphol
Dror Weitz
University of California at Berkeley
1
Motivation: Web Measurements
• Main goal:
Develop a cheap method to sample uniformly from the Web
• Use a random sample of web pages to approximate:
–
–
–
–
–
search engine coverage
domain name distribution (.com, .org, .edu)
percentage of porn pages
average number of links in a page
average page length
• Note: A web page is a static html page
2
The Structure of the Web
(Broder et al., 2000)
left
side
1/4
large strongly
connected
component
1/4
1/4
tendrils &
isolated regions
right
1/4 side
indexable
web
3
Why is Web Sampling Hard?
• Obvious solution: sample from an index of all pages
• Maintaining an index of Web pages is difficult
– Requires extensive resources (storage, bandwidth)
– Hard to implement
• There is no consistent index of all Web pages
– Difficult to get complete coverage
– Month to crawl/index most of the Web
– Web is changing every minute
4
Our Approach:
Random Walks for Random Sampling
• Random walk on a graph provides a sample of nodes
• Graph is undirected and regular  sample is uniform
– Problems: The Web is neither undirected nor regular
• Our solution
– Incrementally create an undirected regular graph with the
same nodes as the Web
– Perform the walk on this graph
5
Related Work
• Monika Henzinger, et al. (2000)
– Random walk produces pages distributed by Google’s page rank.
– Weight these pages to produce a nearly uniform sample.
• Krishna Bharat & Andrei Broder (1998)
– Measured relative size and overlap of search engines using
random queries.
• Steve Lawrence & Lee Giles (1998, 1999)
– Size of the web by probing IP addresses and crawling servers.
– Search engine coverage in response to certain queries.
6
Random Walks: Definitions
v
u
From node v pick any outgoing
edge v u with equal probability.
Go to u.
Markov process The probability
of a transition depends only on
the current state.
probability distribution qt
qt(v) = prob. v is visited at step t
Transition matrix A
qt+1 = qtA
Stationary distribution
Limit as t grows of qt if it exists
and is independent of q0
Mixing time # of steps required
to approach the stationary
distribution
7
Straightforward Random Walk on the Web
amazon.com
Follow a random
out-link at each
step
netscape.com
4
7
1
3
6
9
5
8
2
www.cs.berkeley.edu/~zivi
• Gets stuck in sinks and in dense Web communities
• Biased towards popular pages
• Converges slowly, if at all
8
WebWalker:
Undirected Regular Random Walk on the Web
Follow a random outlink or a random inlink at each step
Use weighted self
loops to even out
pages’ degrees
3
5
amazon.com
3
1
2
netscape.com 4
3
0
3
0
4
3
2
1
1
3
2
2
w(v) = degmax - deg(v)
2
4
www.cs.berkeley.edu/~zivi
Fact:
A random walk on a connected undirected regular graph
converges to a uniform stationary distribution.
9
WebWalker: Mixing Time
Theorem [Markov chain folklore]:
A random walk’s mixing time is at most log(N)/(1 - 2)
where
N = size of the graph
1 - 2 = eigenvalue gap of the transition matrix
Experiment (using an extensive Alexa crawl of the web from 1996)
WebWalker’s eigenvalue gap: 1 - 2  10-5
Result: Webwalker’s mixing time is 3.1 million steps
• Self loop steps are free
• Only 1 in 30,000 steps is not a self loop step (degmax  3x105, degavg= 10)
Result: Webwalker’s actual mixing time is only 100 steps!
10
WebWalker: Mixing Time (cont.)
• Mixing time on the current Web may be similar
– Some evidence that the structure of the Web today is
similar to the structure in 1996
(Kumar et al., 1999, Broder et al., 2000)
11
WebWalker: Realization (1)
Webwalker(v):
• Spend expected degmax/deg(v) steps at v
• Pick a random link incident to v (either v  u or u  v)
• Webwalker(u)
Problems
• The in-links of v are not available
• deg(v) is not available
Partial sources of in-links:
• Previously visited nodes
• Reverse link services of search engines
12
WebWalker: Realization (2)
• WebWalker uses only available links:
– out-links
– in-links from previously visited pages
– first r in-links returned from the search engines
• WebWalker walks on a sub-graph of the Web
– sub-graph induced by available links
– to ensure consistency: as soon as a page is visited its
incident edge list is fixed for the rest of the walk
13
WebWalker: Example
Web Graph
v5
v1
v2
WebWalker’s
Induced Sub-Graph
v6
v5
v6
v3
v4
v1
0
v2
1
v3
2
1
v4
w
covered by search engines
1
1
not covered by search engines
available link
non-available link
14
WebWalker: Bad News
• WebWalker becomes a true random walk only
after its induced sub-graph “stabilizes”
• Induced sub-graph is random
• Induced sub-graph misses some of the nodes
• Eigenvalue gap analysis does not hold anymore
15
WebWalker: Good News
• WebWalker eventually converges to a uniform
distribution on the nodes of its induced sub-graph
• WebWalker is a “close approximation” of a random
walk much before the sub-graph stabilizes
• Theorem: WebWalker’s induced sub-graph is
guaranteed to eventually cover the whole
indexable Web.
• Corollary: WebWalker can produce uniform
samples from the indexable Web.
16
Evaluation of WebWalker’s Performance
Questions to address in experiments:
• Structure of induced sub-graphs
• Mixing time
• Potential bias in early stages of the walk:
– towards high degree pages
– towards the search engines
– towards the starting page’s neighborhood
17
WebWalker: Evaluation Experiments
• Run WebWalker on the 1996 copy of the Web
– 37.5 million pages
– 15 million indexable pages
– degavg= 7.15
– degmax= 300,000
• Designate a fraction p of the pages as the search engine
index
• Use WebWalker to generate a sample of 100,000 pages
• Check the resulting sample against the actual values
18
Evaluation: Bias towards High Degree Nodes
Percent of
nodes from
walk
High
Degree
Deciles of nodes ordered by degree
Low
Degree
19
Evaluation: Bias towards the Search Engines
Estimate
of search
engine
size
30%
50%
Search engine size
20
Evaluation: Bias towards the Starting Node’s
Neighborhood
Percent of
nodes from
walk
Close to
Starting
Node
Far from
Starting
Node
Deciles of nodes by distance from starting node
21
WebWalker: Experiments on the Web
• Run WebWalker on the actual Web
• Two runs of 34,000 pages each
• Dates: July 8, 2000 - July 15, 2000
• Used four search engines for reversed links:
• AltaVista, HotBot, Lycos, Go
22
Domain Name Distribution
60.00%
WebWalker (70,000)
50.00%
Henzinger et al. Walk (2 million)
Henzinger et al. Crawl (80 million)
40.00%
Inktomi Crawl (1 billion)
30.00%
20.00%
10.00%
0.00%
com org edu net
de
jp
uk
us gov ch
ca
fr
tw
au
mil
23
Search Engine Coverage
80%
70%
60%
68%
54%
50%
50%
50%
48%
38%
40%
30%
20%
10%
0%
Google AltaVista
Fast
Lycos
HotBot
Go
24
Web Page Parameters
• Average page size:
8,390 Bytes
• Average # of images on a page:
9.3 Images
• Average # of hyperlinks on a page: 15.6 Links
25
Conclusions
• Uniform sampling of Web pages by random walks
• Good news:
– walk provably converges to a uniform distribution
– easy to implement and run with few resources
– encouraging experimental results
• Bad news:
– no theoretical guarantees on the walk’s mixing time
– some biases towards high degree nodes and the search engines
• Future work:
– obtain a better theoretical analysis
– eliminate biases
– deal with dynamic content
26
Thank You!
27
Download