CS728 Lecture 16 Web indexes II Last Time • Indexes for answering text queries – given term produce all URLs containing – Compact representations • for postings used gap (Elias) encodings • for dictionary: used pointers into string of terms Today’s lecture • Indexes for connectivity testing • Distance and Transitive Closure • Data Structure: 2-hop covers Connectivity Server • Support for fast queries on the web graph – Which URLs point to a given URL? – Which URLs does a given URL point to? Stores mappings in memory from • URL to outlinks, URL to inlinks • Applications – Crawl control, Web graph analysis • Connectivity, crawl optimization – Link analysis Adjacency lists • The set of neighbors of a node • Assume each URL represented by an integer • E.g., for a 4 billion page web, need 32 bits per node • Naively, this demands 64 bits to represent each hyperlink Adjacency list compression • Properties exploited in compression: –Similarity (between lists) –Locality (many links from a page go to “nearby” pages) –Use gap encodings in sorted lists –Distribution of gap values Storage • Recently paper by Boldi/Vigna report get down to an average of ~3 bits/link – (URL to URL edge) Why is this remarkable? – For a 118M node web graph • How? Main ideas of Boldi/Vigna • Consider lexicographically ordered list of all URLs, e.g., – www.stanford.edu/alchemy – www.stanford.edu/biology – www.stanford.edu/biology/plant – www.stanford.edu/biology/plant/copyright – www.stanford.edu/biology/plant/people – www.stanford.edu/chemistry Boldi/Vigna • Each of these URLs has an adjacency list • Main thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering • Express adj list in terms of one of these • E.g., consider these adjacency lists – 1, 2, 4, 8, 16, 32, 64 – 1, 4, 9, 16, 25, 36, 49, 64 – 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 – 1, 4, 8, 16, 25, 36, 49, 64 Connectivity Queries • Beyond adjacency we’d like to answer – Transitive closure: is there a path from x to y? – Distance: what is the length of shortest path from x to y? • Applications – Link analysis – XML path queries with wildcards Naïve Solutions • Given graph – Compute and store APSPs – Answer any query in constant time – Space requirements? • OR online – Given query compute SSSP – No additional space – Time to answer query? Encoding Problem Find a compact representation for the transitive closure • whose size is comparable to the data‘s size • that supports connection tests (almost) as fast as the naive transitive closure lookup • that can be built efficiently for large data sets Main Idea: 2-Hop Covers and 2-Hop Labeling • 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) • For each connection (a,b), – choose a node c on the path from a to b (center node) – add c to Lout(a) and to Lin(b) • Then (a,b)Transitive Closure T Lout(a)Lin(b)≠ a c b Reachability and distance queries via 2-hop Labels (Cohen et al., SODA 2002) 2-hop Covers • Conjecture: 2-hop covers always exist of size O(n √m ) • Goal: Minimize the sum of the label sizes • Problem is NP-complete – => approximation required • Theorem: There exists a polytime algorithm that approx optimal within factor of log n. • Greedy (set cover) algorithm Approximation Algorithm Initial step: What are good center nodes? All connections 2 cover 4 many 5 uncovered Nodes1 that can are uncovered connections. 3 6 Consider the center graph of candidates initial density: 2 I 1 4 2 5 6 O Edges 8 8 1.33 I O 24 6 density (We canof cover densest 8 connections subgraph with 6 coversame (here: entries) as initial density) Approximation Algorithm What are good center nodes? Initial step: 1 2 4 5 Nodes that can cover many uncovered All connections connections. 3 6 are uncovered Consider the center graph of candidates 1 2 I 3 4 4 5 O 6 initialconnections density: Cover in subgraph 12density 12 with withEdges greatest 1.71 corresponding I O 4 center 3 7 node density of densest subgraph = initial density (graph is complete) Approximation Algorithm What are good center nodes? Next step: 1 2 4 5 Nodes that can cover many uncovered Some connections connections. 3 6 already covered Consider the center graph of candidates I 1 2 Repeat this algorithm until all 2 O connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor Experimental Results Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14 Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB Experimental Results For example above: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries compression factor of ~267 queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM! Why Distances are Difficult • Should be simple to add: v 2 u Lout(v)={u, …} Lin(w)= {u, …} 4 Lout(v)={(u,2), …} Lin(w)= {(u,4), …} dist(v,w)=dist(v,u)+dist(u,w)=2+4=6 • Is this correct ... w Why Distances are Difficult v u 2 w 4 dist(v,w)=1 Center node u does not reflect the correct distance of v and w Solution: Distance-aware Centergraph • Add edges to the center graph only if the corresponding connection is a shortest path 1 2 4 3 5 6 I 1 4 2 3 5 6 O 4 • Correct, problems: – Expensive to build the center graph (2 additional lookups per connection) - Approx bound is no longer tight Enhancements • Allow for approx distances for more compact representations