Document 13572819

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search) in networks Web search Hubs and authorities: HITS algorithm PageRank algorithm Reading: EK, Chapter 20.3-20.6 (navigation) EK, Chapter 14 (web search) (Optional reading:) “The Small-World Phenomenon: An Algorithmic Perspective,” by Jon Kleinberg, in Proc. of ACM Symposium on Theory of Computing, pp./ 163-170, 2000. 2 Networks: Lecture 7 Introduction We have studied random graph models of network structure. Static random graph models (Erdös-Renyi model, configuration model, small-world model) Dynamic random graph models (preferential attachment model) In the next two lectures, we will study processes taking place on networks. In particular, we will focus on: Navigation or decentralized search in networks Web search: ranking web pages Spread of epidemics 3 Networks: Lecture 7 Decentralized Search Recall Milgram’s small-world experiment, where the goal was to find short chains of acquaintances (short paths) linking arbitrary pairs of people in the US. A source person in Nebraska is asked to deliver a letter to a target person in Massachusetts. This will be done through a chain where each person forwards the letter to someone he knows on a first-name basis. Over many trials, the average number of intermediate steps in successful chains was found to lie between 5 and 6, leading to six degrees of separation principle. Milgram’s experiment has two fundamentally surprising discoveries. First is that such short paths exist in networks of acquaintances. The small-world model proposed by Watts and Strogatz (WS) was aimed at capturing two fundamental properties of networks: short paths and high clustering. 4 Networks: Lecture 7 Decentralized Search Second, people are able to find the short paths to the designated target with only local information about the network. If everybody knows the global network structure or if we can “flood the network” (i.e., everyone will send the letter to all their friends), we would be able to find the short paths efficiently. With local information, even if the social network has short paths, it is not clear that such decentralized search will be able to find them efficiently. a p b c o n d m e l f g j i h Image by MIT OpenCourseWare. Adapted from Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York, NY: Cambridge University Press, 2010. ISBN: 9780521195331. Figure: In myopic search, the current message-holder chooses the contact that is closest to the target and forwards the message to it. 5 Networks: Lecture 7 A Model for Decentralized Search—1 Kleinberg introduces a simple framework that encapsulates the paradigm of WS – rich in local connections with a few long range links. The starting point is an n × n two-dimensional grid with directed edges (instead of an undirected ring). The nodes are identified with the lattice points, i.e., a node v is identified with the lattice point (i, j ) with i, j ∈ {1, . . . , n }. For any two nodes v and w , we define the distance between them d (v , w ) as the number of grid steps between them, d ((i, j ), (k, l )) = |k − i | + |l − j |. Each node is connected to its 4 local neighbors directly – his local contacts. Each node also has a random edge to another node – his long range contact. Image by MIT OpenCourseWare. Adapted from Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York, NY: Cambridge University Press, 2010. ISBN: 9780521195331. Figure: A grid network with n = 6 and the contacts of a node u. 6 Networks: Lecture 7 A Model for Decentralized Search—2 The model has a parameter that controls the “scales spanned by the long-range weak ties.” The random edge is generated in a way that decays with distance, controlled by a clustering exponent α: In generating a random edge out of v , we have this edge link to w with probability proportional to d (v , w )−α . When α = 0, we have the uniform distribution over long-range contacts – the distribution used in the model of WS. As α increases, the long-range contact of a node becomes more clustered in its vicinity on the grid. Image by MIT OpenCourseWare. Adapted from Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York, NY: Cambridge University Press, 2010. ISBN: 9780521195331. Figure: (left) Small clustering exponent, (right) large clustering exponent 7 Networks: Lecture 7 Decentralized Search in this Model We evaluate different search procedures according to their expected delivery time– the expected number of steps required to reach the target (where the expectation is over a randomly generated set of long-range contacts and randomly chosen starting and target nodes). Given this set-up, we will prove that decentralized search in WS model will necessarily require a large number of steps to reach the target (much larger than the true length of the shortest path). As a model, WS network is effective in capturing clustering and existence of short paths, but not the ability of people to actually find those paths. The problem here is that the weak ties that make the world small are “too random” in this model. The parameter α captures a tradeoff between how uniform the random links are. Question: Is there an optimal α (or network structure) that allows for rapid decentralized search? 8 Networks: Lecture 7 Efficiency of Decentralized Search Theorem (Kleinberg 2000) Assume that each node only knows his set of local contacts, the location of his long-range contact, and the location of the target (crucially, he does not know the long-range contacts of the subsequent nodes). Then: (a) For 0 ≤ α < 2, the expected delivery time of any decentralized algorithm is at least βα n(2−α)/3 for some constant βα . (b) For α = 2, there is a decentralized algorithm so that the expected delivery time is at most β2 (log (n ))2 for some constant β2 . (c) For 2 < α < 3, the expected delivery time of any decentralized algorithm is at least βα n(α−2) for some constant βα . Decentralized search with α = 2 requires time that grows like a polynomial in log(n ), while any other exponent requires time that grows like a polynomial in n – exponentially worse. 9 Networks: Lecture 7 Proof Idea for α = 0 The basic idea is depicted in the figure below: We randomly pick a target node in the grid and consider the (square) box around t of side length n2/3 . With high probability, the source node lies outside the box. Because long range contacts are created uniformly at random (since α = 0), the probability that any one node has a long range contact inside the box is 4/3 given by the size of the box divided by n2 , i.e., nn2 = n−2/3 . Therefore, any decentralized search algorithm will need at least n2/3 steps in expectation to find a node with a long-range contact in the box. Moreover, as long as it doesn’t find a long range link leading into the box, it cannot reach the target in less than n2/3 steps (using only local contacts). This shows that the expected time for any decentralized search algorithm must take at least n2/3 steps to reach the target node. 10 Networks: Lecture 7 Proof Idea for 0 ≤ α < 2 We use a similar construction for this range: we consider a (square) box (2− α ) around the target with side length nγ , where γ ≡ 3 . Recall that a node v forms a long-range link with node w with probability proportional to d (v , w )−α . Here, the constant of proportionality is 1/Z with Z = ∑w d (v , w )−α . Note that in the 2-dimensional grid, a node has ≈ d neighbors at distance d from itself. This implies that n Z = ∑ d.d −α ≈ n2−α = n3γ , d =1 where the last relation follows using an integral approximation. Hence, the probability that any one node has a long range contact inside the 2γ box satisfies ≤ nn3γ = n−γ (since there are n2γ nodes inside the box) This shows that any decentralized search algorithm will need at least (2− α ) 3 nγ = n the box. steps in expectation to find a node with a long-range contact in 11 Networks: Lecture 7 Proof Idea for α = 2 We again first compute the proportionality constant Z : n n 1 1 Z = ∑ d 2 = ∑ ≈ log(n ). d d d =1 d =1 (where the last relation follows by an integral approximation). Instead of picking an “impenetrable box” around the target, we show that it is easy to enter smaller and smaller sets centered around the target node. We consider a node at a distance 2s from the target and compute the time it takes to halve this distance (i.e., get into a box of size s). 12 Networks: Lecture 7 Proof Idea for α = 2 (Continued) The probability of having a long range contact in the box of size s is ≥ s 2 log1(n) (2s1)2 ≈ log1 (n) . Hence, it takes in expectation log(n ) time steps to halve the distance. Since we can halve the distance at most log(n ) times, this leads to an algorithm that takes (log(n ))2 steps to reach the target in expectation. 13 Networks: Lecture 7 Proof Idea for 2 < α < 3 Computing the proportionality constant Z in this case yields Z = constant (independent of n) for large n. The expected length of a typical long-range contact is given by E[length of a typical long range contact] = 1 n 1 d.d. α ≈ n3−α . ∑ d Z d =1 Hence, the expected time is at least n/n3−α = nα−2 . 14 Networks: Lecture 7 Web Search – Link Analysis When you go to Google and type “MIT”, the first result it displays is web.mit.edu, the home page of MIT University. How does Google know that this was the best answer? This is a problem of information retrieval: since 1960’s, automated information retrieval systems were designed to search data repositories in response to keyword queries. Classical approach has been based on “textual analysis”, i.e., look at each page separately without regard to the link structure. In the example of MIT home page, what makes it stand out is the number of links that “point to it”, which can be used to assess the authority of a page on the topic: First, collect a large sample of pages that are relevant to the query as determined by a classical, text-only, information retrieval approach. Pick the webpage that receives the greatest number of in-links (score) from these pages. 15 Networks: Lecture 7 Link Structure–1 For a different query, say “newspaper”, the “most important page” seems less obvious, i.e., you get a high score for a mix of prominent newspapers, as well as for pages that are going to receive many links no matter what the query is (such as Yahoo, Facebook, Amazon). Image by MIT OpenCourseWare. Adapted from Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York, NY: Cambridge University Press, 2010. ISBN: 9780521195331. Figure: Counting in-links to pages for the query “newspapers”. 16 Image by MIT OpenCourseWare. Adapted from Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York, NY: Cambridge University Press, 2010. ISBN: 9780521195331. Networks: Lecture 7 Hubs and Authorities–1 This suggests a ranking procedure which is defined in terms of two kinds of nodes: Authorities: The prominent highly endorsed answers to the queries (nodes that are pointed to by highly ranked nodes) Hubs: High-value lists (nodes that point to highly ranked nodes) For each page p, we are trying to estimate its value as a potential authority and as a potential hub, so we assign it to numerical values b (p ) (for authority weight) and h(p ) (for hub weight). Let A denote the n × n adjacency matrix, i.e., Aij = 1 if there is a link from node i to node j. The authority and hub weights then satisfy: h (i ) = ∑ Aij b (j ) for all i, ∑ Aij h(i ) for all j. j b (j ) = i 18 Networks: Lecture 7 Hubs and Authorities–2 This can be written in matrix-vector notation as: h = Ab, b T = hT A or b = AT h, where b T (or AT ) denotes the transpose of vector b (or matrix A). This yields an iterative procedure in which at each time k ≥ 0, the authority and hub weights are updated according to: bk +1 = (AT A)bk , hk +1 = (AAT )hk . Using an eigenvector decomposition, one can show that this iteration converges to a limit point related to the eigenvectors of the matrices AT A and AAT . Implementation of this algorithm requires “global knowledge,” therefore it is implemented in a “query-dependent manner”. 19 Networks: Lecture 7 Page Rank–1 Multi-billion query-independent idea of Google. Each node (or page) is important if it is cited by other important pages. Each node j has a single weight (PageRank value) w (j ) which is a function of the weights of his (incoming) neighbors: w (j ) = w (i ) ∑ dout (i ) Aij , i where A is the adjacency matrix, dout (i ) is the out-degree of node i (used to dilute the importance of node i if he is linked to many nodes). We can express this in vector-matrix notation as: w T = w T P, A where Pij = d ij(i ) (note that ∑j Pij = 1). out 20 Networks: Lecture 7 Page Rank–2 This defines a random walk on the nodes of the network: A walker chooses a starting node uniformly at random. Then, in each step, the walker follows an outgoing link selected uniformly at random from its current node, and it moves to the node that this link points to. The PageRank of a node i is the limiting probability that a random walk on the network will end up at i as we run the walk for larger number of steps. Difficulty with PageRank: Dangling ends which may cause the random walk to get trapped. To fix this, we allow the random walk at each step with probability (1 − s ) to jump to a node chosen uniformly at random (with probability s, as before, it follows a random outgoing link). This yields the following iteration for the PageRank algorithm: at step k, (1 − s ) T e , where e T = [1, . . . , 1]. n This is the version of PageRank used in practice (with many other tricks) with s usually between 0.8 and 0.9. wkT+1 = swkT P + 21 MIT OpenCourseWare http://ocw.mit.edu 14.15J / 6.207J Networks Fall 2009 For information about citing these materials or our Terms of Use,visit: http://ocw.mit.edu/terms.

Document 13572819

Related documents

Products

Support

Document 13572819

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib