Lecture 16

CS728 Lecture 16 Web indexes II Last Time • Indexes for answering text queries – given term produce all URLs containing – Compact representations • for postings used gap (Elias) encodings • for dictionary: used pointers into string of terms Today’s lecture • Indexes for connectivity testing • Distance and Transitive Closure • Data Structure: 2-hop covers Connectivity Server • Support for fast queries on the web graph – Which URLs point to a given URL? – Which URLs does a given URL point to? Stores mappings in memory from • URL to outlinks, URL to inlinks • Applications – Crawl control, Web graph analysis • Connectivity, crawl optimization – Link analysis Adjacency lists • The set of neighbors of a node • Assume each URL represented by an integer • E.g., for a 4 billion page web, need 32 bits per node • Naively, this demands 64 bits to represent each hyperlink Adjacency list compression • Properties exploited in compression: –Similarity (between lists) –Locality (many links from a page go to “nearby” pages) –Use gap encodings in sorted lists –Distribution of gap values Storage • Recently paper by Boldi/Vigna report get down to an average of ~3 bits/link – (URL to URL edge) Why is this remarkable? – For a 118M node web graph • How? Main ideas of Boldi/Vigna • Consider lexicographically ordered list of all URLs, e.g., – www.stanford.edu/alchemy – www.stanford.edu/biology – www.stanford.edu/biology/plant – www.stanford.edu/biology/plant/copyright – www.stanford.edu/biology/plant/people – www.stanford.edu/chemistry Boldi/Vigna • Each of these URLs has an adjacency list • Main thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering • Express adj list in terms of one of these • E.g., consider these adjacency lists – 1, 2, 4, 8, 16, 32, 64 – 1, 4, 9, 16, 25, 36, 49, 64 – 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 – 1, 4, 8, 16, 25, 36, 49, 64 Connectivity Queries • Beyond adjacency we’d like to answer – Transitive closure: is there a path from x to y? – Distance: what is the length of shortest path from x to y? • Applications – Link analysis – XML path queries with wildcards Naïve Solutions • Given graph – Compute and store APSPs – Answer any query in constant time – Space requirements? • OR online – Given query compute SSSP – No additional space – Time to answer query? Encoding Problem Find a compact representation for the transitive closure • whose size is comparable to the data‘s size • that supports connection tests (almost) as fast as the naive transitive closure lookup • that can be built efficiently for large data sets Main Idea: 2-Hop Covers and 2-Hop Labeling • 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops • For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) • For each connection (a,b), – choose a node c on the path from a to b (center node) – add c to Lout(a) and to Lin(b) • Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠ a c b Reachability and distance queries via 2-hop Labels (Cohen et al., SODA 2002) 2-hop Covers • Conjecture: 2-hop covers always exist of size O(n √m ) • Goal: Minimize the sum of the label sizes • Problem is NP-complete – => approximation required • Theorem: There exists a polytime algorithm that approx optimal within factor of log n. • Greedy (set cover) algorithm Approximation Algorithm Initial step: What are good center nodes? All connections 2 cover 4 many 5 uncovered Nodes1 that can are uncovered connections. 3 6  Consider the center graph of candidates initial density: 2 I 1 4 2 5 6 O Edges 8 8    1.33 I  O 24 6 density (We canof cover densest 8 connections subgraph with 6 coversame (here: entries) as initial density) Approximation Algorithm What are good center nodes? Initial step: 1 2 4 5 Nodes that can cover many uncovered All connections connections. 3 6 are uncovered  Consider the center graph of candidates 1 2 I 3 4 4 5 O 6 initialconnections density: Cover in subgraph 12density 12 with withEdges greatest    1.71 corresponding I  O 4 center 3 7 node density of densest subgraph = initial density (graph is complete) Approximation Algorithm What are good center nodes? Next step: 1 2 4 5 Nodes that can cover many uncovered Some connections connections. 3 6 already covered  Consider the center graph of candidates I 1 2 Repeat this algorithm until all 2 O connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor Experimental Results Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14 Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB Experimental Results For example above: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries  compression factor of ~267  queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM! Why Distances are Difficult • Should be simple to add: v 2 u Lout(v)={u, …} Lin(w)= {u, …}  4 Lout(v)={(u,2), …} Lin(w)= {(u,4), …} dist(v,w)=dist(v,u)+dist(u,w)=2+4=6 • Is this correct ... w Why Distances are Difficult v u 2 w 4 dist(v,w)=1  Center node u does not reflect the correct distance of v and w Solution: Distance-aware Centergraph • Add edges to the center graph only if the corresponding connection is a shortest path 1 2 4 3 5 6 I 1 4 2 3 5 6 O 4 • Correct, problems: – Expensive to build the center graph (2 additional lookups per connection) - Approx bound is no longer tight Enhancements • Allow for approx distances for more compact representations

Lecture 16

Related documents

Products

Support

Lecture 16

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib