I(v) - Sapienza

advertisement
Models and Structure of The Web Graph
Stefano Leonardi
Università di Roma “La Sapienza”
thanks to A. Broder, E. Koutsoupias, P. Raghavan,
D. Donato, G. Caldarelli, L. Salete Buriol
Complex Networks
• Self-organized and
operated by a multitude
of distributed agents
without a central plan
• Dynamically evolving in
time
• Connections established
on the basis of local
knowledge
• Nevertheless efficient
and robust, i.e.
– allow to reach every other
vertex in a few hops
– Resilient to random faults
• Examples:
– The physical Internet
– The Autonomous System
graph
– The Web Graph
– The graphs of phone calls
– Food webs
– E-mail
– etc….
Outline of the Lecture
•
•
•
•
•
Small World Phenomena
•
Experiments
The Web Graph
•
Algorithms
•
Mathematical
Models
–
–
The Milgram Experiment
Models of the Small World
–
–
–
Power laws
The Bow Tie Structure
Topological properties
–
–
–
Preferential Attachment
Copying model
Multi-layer
–
Geography matters
Representation and Compression of
the Web Graph
Graph Models for the Web
The Internet Graph
Small World Phenomena
Properties of Small World Networks
• Surprising properties of Small World:
– The graph of acquantancies has small diameter
– There is a simple local algorithm that can route a
message in few steps
– The number of edges is small, linear in the number
of vertices
– The network is resilient to edge faults
Small World Phenomena
• A small world is a network
– a large fraction of short chains of acquantancies
– a small fraction of 'shortcuts' linking clusters with
one another, superposition of structured clusters
– Distances are short – no high degree nodes needed
• There are lots of small worlds:
–
–
–
–
spread of diseases
electric power grids
phone calls at a given time
etc.
The model of Watts and Strogatz[1998]
1.
2.
3.
Take a ring of n vertices with every vertex connected to the
next k=2 nodes
Randomly rewire every edge to a random destination with pb p
The resulting graph has with high probability
diameter = O(log n)
diameter = maxu,vdistance(u,v)
The model of Jon Kleinberg [2000]
• Consider a 2-dimensional grid
• For each node u add edge (u,v) to a vertex v
selected with pb proportional to [d(u,v)]-r
– If r = 0, v selected at random as in WS
– If r = 2, v between x and 2x with equal probability
Routing in the Small World
• Define a local routing algorithm that knows:
–
–
–
–
Its position in the grid
The postition in the grid of the destination
The set of neighbours, short range and long range
The neighbours of all the vertices that have seen the message
• If r=2, expected delivery time is O(log2 n)
• If r≠2, expected delivery time is Ώ(nε), ε depends on r
The Web Graph
Web graph
•Notation: G = (V, E) is given by
– a set of vertices (nodes) denoted V
– a set of edges (links) = pairs of nodes denoted E
•The page graph (directed)
– V = static web pages (4,2 B) #pages indexed by
– E = static hyperlinks (30 B?)
Google on March 5
Why is it interesting to study the Web Graph?
•It is the largest artifact ever conceived by the
human
•Exploit its structure of the Web for
– Crawl strategies
– Search
– Spam detection
– Discovering communities on the web
– Classification/organization
•Predict the evolution of the Web
– Mathematical models
– Sociological understanding
Many other web/internet related graphs
• Physical network graph
– V = Routers
– E = communication links
• The host graph (directed)
– V = hosts
– E = There is an edge from a page on host A to a page on host B
• The “cosine” graph (undirected, weighted)
– V = static web pages
– E = cosine distance among term vectors associated with pages
• Co-citation graph (undirected, weighted)
– V = static web pages
– E = (x,y) number of pages that refer to both x and y
• Communication graphs (which hosts talks to which hosts at a
given time)
• Routing graph (how packets move)
• Etc
Observing Web Graph
• It is a huge ever expanding graph
• We do not know which percentage of it we know
• The only way to discover the graph structure of the
web as hypertext is via large scale crawls
• Warning: the picture might be distorted by
– Size limitation of the crawl
– Crawling rules
– Perturbations of the "natural" process of birth and death of
nodes and links
Naïve solution
•Keep crawling, when you stop seeing more pages, stop
•Extremely simple but wrong solution: crawling is
complicated because the web is complicated
– spamming
– duplicates
– mirrors
•First example of a complication: Soft 404
– When a page does not exists, the server is supposed to return
an error code = “404”
– Many servers do not return an error code, but keep the visitor
on site, or simply send him to the home page
The Static Public Web
•Static
– not the result of a cgi-bin scripts
– no “?” in the URL
– doesn’t change very often
– etc.
•Public
– no password required
– no robots.txt exclusion
– no “noindex” meta tag
– etc.
Static vs. dynamic pages
•“Legitimate static” pages built on the fly
– “includes”, headers, navigation bars, decompressed text, etc.
•“Dynamic pages” appear static
– browseable catalogs (Hierarchy built from DB)
•Huge amounts of catalog & search results pages
– Shall we count all the Amazon pages?
•Very low reliability servers
•Seldom connected servers
•Spider traps -- infinite url descent
– www.x.com/home/home/home/…./home/home.html
•Spammer games
In some sense, the “static” web is infinite
…
The static web is whatever pages can be
found in at least one major search engine.
Large scale crawls
•[KRRT99] Alexa crawl, 1997, 200M pages
– Trawling the Web for emerging cyber communities, R.
Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins,
•[BMRRSTW 2000] AltaVista crawls, 500 M
pages
– Graph structure on the web, A. B., F. Maghoul, P.
Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, & J.
Wiener
•[LLMMS03] Webbase Stanford project, 2001.
400 M pages
– Algorithms and Experiments for the Webgraph
L. Laura, S. Leonardi, S. Millozzi, U. Meyer, J. Sibeyin,
et al..
Graph properties of the Web
– Global structure -- how does the web look from far
away ?
– Connectivity –- how many connections?
– Connected components –- how large?
– Reachability -- can one go from here to there ? How
many hops?
– Dense Subgraphs –- are there good clusters?
Basic definitions
•In-degree = number of incoming links
– www.yahoo.com has in-degree ~700,000
•Out-degree = number of outgoing links
– www.altavista.com has out-degree 198
•Distribution = #nodes with given in/out-degree
Power laws
•Inverse polynomial tail (for large k)
– Pr(X = k) ~ 1/k
– Graph of such a function is a line on a log-log scale
log(Pr(X=k)) = -  * log(k)
– Very large values possible & probable
•Exponential tail (for large k)
– Pr(X = k) ~ e-k
– Graph of such function is a line on a log scale
log(Pr(X=k)) = -k
– No very large values
• Examples
Power laws
– Inverse poly tail: the distribution of wealth,
popularity of books, etc.
– Inverse exp tail: the distribution of age, etc
– Internet network topology [Faloutsos & al. 99]
Power laws on the web
– Web page access statistics [Glassman 97, Huberman
& al 98 Adamic & Huberman 99]
– Client-side access patterns [Barford & al.]
– Site popularity
– Statistical Physics: Signature of self-similar/fractal
structure
The In-degree distribution
Altavista crawl, 1999
Broder et al.
WebBase Crawl 2001
Laura et al.2003
Indegree follows power law distributionPr[in - degree (u )  k ] 
  2.1
1
k

Graph structure of the Web
Altavista, 1999
WebBase 2001
Out-degree follow power law distribution? Pages with a large number
of outlinks are less frequent.
What is good about 2.1
– Expected number of nodes with degree k
2.1
~n/k
– Expected contribution to total edges
2.1
1.1
~n/k
*k=~n/k
– Summing over all k, the expected number of edges is
~n
– This is good news: our computational power is such
that we can deal with work linear in the size of the
web, but not much about that.
– The average number of incoming link is 7 !!
More definitions
•Weakly connected components (WCC)
– Set of nodes such that from any node can go to any node via an
undirected path.
•Strongly connected components (SCC)
– Set of nodes such that from any node can go to any node via a directed
path.
WCC
SCC
The Bow Tie [Broder et al., 2000]
CORE
IN
OUT
TENDRILS
TUBES
DISC
…/~newbie/
www.ibm.com
/…/…/leaf.htm
Experiments (1)
WebBase ‘01
Altavista ‘99
DISC.
Regione 8%
TENDRILS
(numero di nodi)
22%
SCC
IN
OUT
TENDRILS
13%
DISC.
4%
SCC
33%
44.713.193
OUT
21%
TENDRILS
DISC.
SCC
Dimensione
28%
14.441.156
IN
21%
1
Pr[sizeScc53.330.475
(u )  i ]  
i
17.108.859
6.172.183
OUT
39%
IN
11%
Experiments (2)
SCC distribution region by region
Power Law 2.07
Experiments (3)
Indegree distribution region by region
Power Law 2.1
Experiments (4)
Indegree distribution in the SCC graph
Power Law 2.1
Experiments (5)
WCC distribution in IN
Power Law 1.8
What did we learn from all these power
laws?
• The second largest SCC is of size less 10000 nodes!
• IN and Out have millions of access points to the
CORE and thousands of relatively large Weakly
Connected Components
• This may help in designing better crawling strategies
at least for IN and OUT, i.e. the load can be splitted
between the robots without much overlapping.
• While power law with exponent 2.1 is an universal
feature of the Web, there is no fractal structure: IN
and OUT do not show any Bow Tie Phenomena
Algorithmic issues
•Apply standard linear-time algorithms for WCC
and SCC
– Hard to do if you can’t store the entire graph in
memory!!
• WCC is easy if you can store V (semi-external algorithm)
• No one knows how to do DFS in semi-external memory, so SCC
is hard ??
• Might be able to do approx SCC, based on low diameter.
•Random sampling for connectivity information
– Find all nodes reachable from one given node
(“Breadth-First Search”)
– BFS is also hard. Simpler on low diameter
Find the CORE
•
Iterate the following process:
–
–
–
–
–
Pick a random vertex v
Compute all node reached from v: O(v)
Compute all nodes that reach v: I(v)
Compute SCC(v):= I(v) ∩ O(v)
Check whether it is the largest SCC
If the CORE is about ¼ of the vertices, after 20 iterations, Pb to
not find the core < 1%.
Find OUT
• SCC  OUT
Find IN
• SCC  IN
Find TENDRILS and TUBES
• IN  TENDRILS_IN
• OUT  TENDRILS_OUT
TENDRILS_IN  TENDRILS_OUT  TENDRILS
TENDRILS_IN  TENDRILS_OUT  TUBES
Find DISC
• DISCONNECTED: what is left.
(2) Compute SCCs
• Classical Algorithms:
– DFS(G)
– Transpose G in GT
– DFS(GT) following vertices in decreasing order of
f[v] (time of the end of the visit)
– Every tree is a SCC.
• DFS hard to compute on secondary memory:
no locality
DFS
Classical Approach
main(){
foreach vertex v do
color[v]=WHITE
endFor
foreach vertex v do
if (color[v]==WHITE)
DFS(v);
endFor
}
DFS(u:vertex)
color[u]=GRAY
d[u] time  time +1
foreach v in succ[u] do
if (color[v]=WHITE) then
p[v] u
DFS(v)
endFor
color[v] BLACK
f[u]  time  time + 1
Semi-External DFS
(1)
(J.F. Sibeyn, J.Abello, U. Meyer)
Memory space:
 (12+1/8) bytes per vertex
a
b
Compute the DFS forest in several iterations until there are no forward
edges:
A forest is a DFS forest if and only if there are no forward
edges
Semi-External DFS
(2)
Data structures
•
Adjacency list to store partial DFS tree
–
–
n+1 integers point to successors
n+k integers to point up to n+k successors ( k>=n)
n+1
0 0 2 3 4 5
0
1
2
n+1 pointers
5
n
3
n+k adjacent vertices
4
k
Semi-External DFS
(3)
WhileDFS changes{
–
–
–
}
Add next k edges to the current DFS
Compute DFS on n+k edges
Update DFS
Computation of SCC
Is the Web a small world?
– Based on a simple model, [Barabasi et. al.] predicted
that most pages are within 19 links of each other.
Justified the model by crawling nd.edu (1999)
•Well, not really!
Distance measurements
•Experiment data (Altavista)
– Maximum directed distance between 2 CORE nodes:
28
– Maximum directed distance between 2 nodes, given
there is a path: > 900
– Average directed distance between 2 SCC nodes: 16
•Experimental data (WebBase)
IN
OUT
– Depth of IN = 8
– Depth of OUT = 112
– Max backward and forward BFS depth in core = 8
More structure in the Web Graph
• Insights from hubs and
authorities:
Dense bipartite subgraph
 Web Community
Hub/fan
• A large number of bipartite
cliques, cores of hidden
Web-communities can be
found in the Webgraph
[R. Kumar, P. Raghavan, S.
Rajagopalan, and A. Tomkins,
99]
(4,3) clique
Authority/center
Disjoint bipartite cliques
200M Crawl Alexa, 1997
Kumar et al.
200M Crawl WebBase 2001
More Cyber-communities and/or better algorithm, for finding disjoint
bipartite cliques
Approach
• Find all disjoint (i,j) cores, 3≤i≤10, 3≤j≤10
• Expand such cores into full communities
• Enumerating all dense bipartite subgraphs is
very expensive
• We run heuristics to approximately
enumerate disjoint bipartite cliques of small
size
Pre processing
•
Remove all vertices with |I(v)|>50
Not interested in popular pages such Yahoo,
CNN, etc…
•
Use iterative pruning:
Remove all centers with |I(v)|<j
Remove all fans with |O(v)|<i
Enumeration of (i,j) cliques
1. For a vertex v,
enumerate size j
subsets S of O(v)
2. If |  I (u) | i then
uS
(i,j) clique found
3. Remove i fans and j
centers
4. Repeat until the
graph is empty
Semi-external algorithm
• List of predecesors and succesors stored in N/B
blocks
• Every block contains the adjacency list of B
vertices and fits in main memory
• Keep two 1-bit arrays Fan() and Center() in main
memory
• Phase I. and II. Easily implemented in streaming
fashion
• Phase III. Problem: Given S, computing  I (u )
uS
needs access to later blocks
Semi-external algorithm
Phase III. If we cannot decide on set S
wihtin the current block, store S  O (v) and
 I (u ) with the next block containing a
uS
vertex of S
When moving a new block to main memory,
explore all vertices in the block and continue
exploration of set S inherited from previous
blocks
Computation of disjoint (4,4) cliques
PageRank
• PageRank measures the steady state visit
rate of a random walk of the Web Graph
• Imagine a random surfer:
– Start from a random page
– At any time choose with equal probability an
outgoing link
1/3
1/3
1/3
Problem
The Web is full of dead ends
?
• Teleporting
– With pb  choose a random outgoing edge
– With pb 1-  continue from a random node
• There is a long term visit rate for every page
Page Rank Computation
• Let a be the PageRank vector and A
the adjacency matrix
• If we start from
distribution a, after
one step we are at
distribution aA
• PageRank is the left
eigenvector of A:
a=aA
• In practice, start
from any vector a
• Repeatedly apply
a=aA
a=aA2
a=aAk
till a is stable
Pagerank distribution
• Pagerank distributed
with a power law with  =
2.1
[Pandurangan Raghavan,
Upfal, 2002] on a sample
of 100.000 vertices
from brown.edu
• In-degree and Pagerank
not correlated!!
• Compute Pagerank on
the WebBase Crawl
• Confirm Pagerank
distributed with  =
2.1
• Pagerank/Indegree
correlation = 0.3
• Efficient external
memory computation
of Pagerank based on
[T.H. Haveliwala,
1999]
Efficient computation of Pagerank
Models of the Web Graph
Why to Study models for the Web
Graph?
• To better understand the process of content
creation
• To test Web applications
• To predict its evolution
and….. A challenging mathematical problem
Standard Theory of Random Graph
(Erdös and Rényi 1960)
• Random Graphs are composed by starting with
n vertices.
• With probability p two vertices are
connected by an edge
P(k)
Degrees are Poisson distributed
P(k )  e  pN
( pN ) k
k!
k
Properties of Random Graphs
• The probability that the degree is far from
expectation c=pn drops exponenatially fast
• Threshold Phenomena:
– If c<1, the largest component has size O(log n)
– If c>1, the largest component has size O(n), all
others have size O(log n)
• If c=ω(log n), the graph is connected and the
diameter is O(log n/log log n)
• We look for Random Graph models holding the
properties of the Web
Features of a good model
• Must evolve over time: pages appear and
disappear over time
• Content creation some time independent,
sometime dependent on the current Web,
some links are random, others are copied
from existing links
• Content creators are biased towards good
Pages
• Must be able to reproduce relevant
observables, i.e. statistical and topological
structures
Details of Models
• Vertices are created in discrete time steps
• At every time step a new vertex is created
that connects with d edges to existing edges
• d=7 in simulation
• Edges are selected according to specific
probability distributions
Evolving Network [Alberts, Barabasi, 1999]
•
Growing Network
with Preferential
attachment:
1. Growth: Every time
step a new node enter
the system with d
edges
2. Preferential
Attachment: The
probability to be
connected depends on
the degree P(k)  k
•
P(k) ~ k-α, α=2
A formal argument[Bollobas, Riordan,
Spencer, Tusnadi, 2001]
• The fraction of nodes that has in-degree k is
proportional to k-3 if d =1.
• Consider 2n nodes and make a random pairing
• Start from the left side. Idetify a vertex with all
consecutive left endpoints, till reach a right endpoint
• Preferential attachment fails to capture othe aspects
of the Web-Graph, such as the large number of small
bipartite cliques
The Copying Model
[Kumar, Raghavan, Rajagopalan, Sivakumar,
Tomkins, Upfal, 2000]
• It is an evolving model: vertices added one by one,
and point with d=7 edges to existing vertices
– When inserting vertex v, choose at random a prototype
vertex u in the graph
– With pb  copy the jth link of u, with pb 1-  choose a
random endpoint
– Indegree follows a power law with exponent 2.1 if  =0.8
• Want to model the process of copying links from
other releated pages
• Try to form Web-communities whose hubs point to
good authorities
Properties of the Copying model
• Let Nt,k be the number of nodes of degree k
at time t.
• Limt∞Nt,k/t ~ k-(2- )/(1- )
• Let Q(i,j,t) be the expected number of
cliques K(i,j) at time t
• Theorem: Q(i,j,t) is Ώ(t) for small values of i
and j
A Multi-Layer model of the Graph
• Web produced by the superposition of multiple
independent regions, Thematically Unifying Clusters
[Dill et al, Self Similarity in the Web, 2002]
• Different regions being different in size and
aggregation criteria, for instance topic, geography or
domain.
• Regions are connected together by a ``connectivity
backbone'' formed by pages that are part of multiple
regions, e.g. documents that are relevant for multiple
topics
Multi-layer model
[Caldarelli, De Los Rios, Laura, Leonardi, Millozzi
2002]
• Model the Web as the
superposition of
different thematically
unified clusters,
generated by
independent stochastic
processes
• Different regions may
follow different
stochastic models of
aggregation
Multi-layer model, details
• Every new vertex is assigned to a constant
number c=3, chosen at random out of L=100
layers.
• Every vertex is connected with d=7 edges
distributed over the c layers.
• Whithin every layer, edges are inserted using
the Copying with  =0.8 or the Evolving
Network model
• The final graph is obtained by merging the
graphs created in all the layers
Properties of the Multi-layer model
• The in-degree distribution follows a power law
with =2.1
• The result is stable for a large variation of
the parameter:
– Total # of layers L
– # of layers to which every page is assigned
– The stochastic model used in a single layer
What about SCCs
• All models presented until now produce
directed graphs without cycles
• Rewiring the Copying model and the Evolving
Network model:
– Generate a graph according to the model on N
vertices
– Insert a number of random edges, from 0.01 N till
3N
• In classical random graph, connected
components suddenly emerge at c=1
Size of the largest SCC/#of SCCs
• We observe the #of SCCs of size 1 and the
size of the largest SCC
• Both measures have a smooth transition as
the number of rewired edges increase.
• No threshold phenomena on SCC for Power
Law graph
• Similar result recently proved by Bollobas and
Riordan for WCC on undirected graphs
Copying model with rewiring
Evolving Network model with rewiring
Efficient computation of SCC
Conclusions on Models for the Web
• In-degree: all models give power law
distributionfor specific parameters. Multilayer achieves  =2.1 for a broad range of
parameter
• Bipartite Cliques: The Copying model achieves
to form a large number of bipartite cliques
• All models have high correlation between Page
Rank and In-degree
• No model replicates the Bow Tie structure
and the Out-degree distribution
The Internet Graph
The map of the Internet
Burch, Cheswich [1999]
Falutsous3 [1999]
Plot of the
frequency of the
outdegree in the
Autonomous system
Graph
How many vantage
points we need to to
sample most of the
connections
between
Autonomous
Systems?
The Web and the Internet are different!
• A model for a physical network should
consider geography.
• Heuristically Optimized Trade-off (Carlson,
Doyle).
• Power law is the outcome of human activity,
i.e. compromise between different
contrastating objectives.
• Network growth compromise between cost
and centrality, i.e. distance and good
positioning in the network.
The FKP model, [Fabrikant, Koutsoupias,
Papadimitriou, 2002]
• Vertices arrive one by one uniformly at
random in the unit square
• Vertex i connects to a previous vertex j<i (a
tree),
– d(i,j): distance between i and j
– hj: measure of centrality of node j
• Avg distance to the other nodes, or
• Avg # hops to other vertices
• Choose j that minimizes  d(i,j)+hj,  depends
on n
Results
• Indegree distributions depends on α:
–  <1/√N the tree is a star
–  = Ώ(√N) the degree is exponentially distributed
– 4<  = o (√N) the degree is distributed with a
power law
Challenges
• Exploit the knowledge of the Web graph to design
better crawling strategies.
• Design models for the Dynamically Evolving Web: e.g.
model the rate of arrival of new connections over
time.
• on-line algorithms with sub-linear space to maintain
toplogical and statistical informations
• Data Structure able to answer queries with time
arguments
Web Graph representation and
compression
Thanks to Luciana Salete Buriol
and Debora Donato
Main features of Web Graphs
Locality: usually most of the hyperlinks are local, i.e, they
point to other URLs on the same host. The literature
reports that on average 80% of the hyperlinks are local.
Consecutivity: links within same page are likely to be
consecutive respecting to the lexicographic order.
URLs normalization: Convert hostnames to lower case,
cannonicalizes port number, re-introducing them where they need,
and adding a trailing slash to all URLs that do not have it.
Main features of WebGraphs
Similarity: Pages on the same host tend to have
many hyperlinks pointing to the same pages.
Consecutivity is the
dual distance-one
similarity.
Literature
Connectivity Server (1998) – Digital Systems Reseach Center
and Stanford University – K. Bharat, A. Broder, M. Henzinger,
P. Kumar, S. Venkatasubramanian;
Link Database (2001) - Compaq Systems Research Center –
K. Randall, R. Stata, R. Wickremesinghe, J. Wiener;
WebGraph Framework (2002) – Universita degli Studi di
Milano – P. Boldi, S. Vigna.
Connectivity Server
➢
➢
➢
➢
Tool for
web graphs visualisation, analysis
(connectivity,
ranking
pages)
and
URLs
compression.
Used by Alta Vista;
Links represented by an outgoing and an incoming
adjacency lists;
Composed of:
URL Database: URL, fingerprint, URL-id;
Host Database: group of URLs based on the hostname
portion;
Link Database: URL, outlinks, inlinks.
Connectivity Server: URL compression
URLs are sorted lexicographically and stored as a
delta encoded entry (70% reduction).
URLs delta
encoding
Indexing
the delta
enconding
Link1: first version of Link
Database
No compression: simple representation of
outgoing and incoming adjacency lists of links.
Avg. inlink size: 34 bits
Avg. outlink size: 24 bits
Link2: second version of Link
Database
Single list compression and starts compression
Avg. inlink size: 8.9 bits
Avg. outlink size: 11.03 bits
Delta Encoding of the Adjacency
Lists
Each array element is 32 bits long.
Delta Encoding of the Adjacency
Lists
-3 = 101-104 (first item)
42 = 174-132 (other items)
.
Nybble Code
The low-order bit of each nybble indicates whether or not
there are more nybbles in the string
The least-significant data bit encodes the sign.
The remaining bits provide an unsigned number
28 = 0111 1000
-28 = 1111 0010
Starts array compression
• The URLs are divided into three partitions
based on their degree;
• Elements of starts indices to nybbles;
• The literature reports that 74% of the
entries are in the low-degree partition.
Starts array compression
Entry range
Partition
# bits
Z(x) > 254
High-degree partition
32
254  Z(x)  24 medium-degree partition (32+P*16)/P
Z(x) < 24
low-degree partition
(32+P*8)/P
Z(x) = max (indegree(x), outdegree(x))
P = the number of pages in each block.
Link3: third version of Link
Database
Interlist compression with representative list
Avg. inlink size: 5.66 bits
Avg. outlink size: 5.61 bits
Interlist Compression
ref : relative index of the representative adjacency list;
deletes: set of URL-ids to delete from the representative list;
adds: set of URL-ids to add to the representative list.
LimitSelect-K-L: chooses the best representative adjacency list from among
the previus K (8) URL-ids' adjacency lists and only allows chains of fewer
than L (4) hops.
 -codes (WebGraph
Framework)
Interlist compression with representative list
Avg. inlink size: 3.08 bits
Avg. outlink size: 2.89 bits
Compressing Gaps
Uncompressed
adjacency list
Adjacency list with
compressed gaps.
Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}
For negative entries:
Using copy lists
Uncompressed
adjacency list
Adjacency list with
copy lists.
Each bit on the copy list informs whether the corresponding successor of y
is also a successor of x;
The reference list index ref. is chosen as the value between 0 and W
(window size) that gives the best compression.
Using copy blocks
Adjacency list with
copy lists.
Adjacency list with
copy blocks.
The last block is omitted;
The first copy block is 0 if the copy list starts with 0;
The length is decremented by one for all blocks except the first
one.
Compressing intervals
Adjacency list with
copy lists.
Adjacency list
with intervals.
Intervals: represented by their left extreme and lenght;
Intervals length: are decremented by the threshold Lmin;
Residuals: compressed using differences.
Compressing intervals
Adjacency list with
copy lists.
Adjacency list
with intervals.
0 = (15-15)*2
600 = (316-16)*2
5 = |13-15|*2-1
3018 = 3041-22-1
50 = ?
Compression comparison
Huff.
Link1
Link2
Link3
z-codes
s-Node
Inlink size Outlink size Access time
# pages (million)# links (million)
Database
15,2
15,4
112
320
WebBase
34
24
13
61
1000
Web Crawler Mercator
8,9
11,03
47
61
1000
Web Crawler Mercator
5,66
5,61
248
61
1000
Web Crawler Mercator
3,25
2,18
206
18,5
300
.uk domain
5,07
5,63
298
900
WebBase
Using different computers and compilers.
Conclusions
The compression techniques are specialized for
Web Graphs.
The average link size decreases with the increase of
the graph.
The average link access time increases with the
increase of the graph.
The -codes seems to have the best trade-off
between avg. bit size and access time.
Download