Plan CS 312: Graphs: BFS and DFS Dan Sheldon I Gale-Shapley Running Time I Graphs I I Motivation and definitions Graph traversal: BFS and DFS February 5, 2015 Running Time of Gale-Shapley? Data Structures Initially all colleges and students are free while some college is free and hasn’t made o↵ers to every student do Choose such a college c Let s be the highest ranked student to whom c has not made an o↵er if s is free then c and s become engaged else if s is engaged to c0 but prefers c to c0 then c0 becomes free c and s become engaged else c remains free end if end while Running-time depends on implementation details and data structures (e.g. how to “choose such a college c”). I Q: How should we think about data structures when designing algorithms? I A: Most of the time, as black boxes with running-time guarantees (e.g., “find an element in O(log n) time”). Good news: don’t need to remember details of data structures Bad news: they may seem opaque O(n2 ) iterations. Are all statements inside the loop constant time? Review: Lists and Arrays What Data Structures to Use For G-S? Array List Get ith entry O(1) O(i) Find element O(n) O(log n) if sorted O(n) Insert/delete O(n) O(1) Note: O(1) = constant number of steps Need to do the following in O(1) time I Find free college c I Find next student s in preference list of c I Find current college c0 of s I Check if s likes c0 better than c What Data Structures to Use For G-S? Another Example: Heapsort Input: prefence lists = 2D arrays, e.g. CollegePref[c, i] = student in position i on c’s list Operation Data structure Fnd a free college c linked list: freeColleges Find next student s in preference list of c array i = Next[c] s = CollegePref[c, i] Find current college c0 of s 1-D array: Current[s] Check if s likes c0 better than c Input unsorted array A[] of length n Let Q be a heap-based priority queue for i = 1 to n do Insert(Q, A[i]) end for for i = 1 to n do A[i] = ExtractMin(Q) end for Running time: (n⇥ Insert) + (n⇥ ExtractMin) 2-D array: Ranking[s,c] I O(n log n) if both operations are O(log n) example on board Graphs Undirected Graph I Motivation and Definitions I Breadth-First Search (BFS) and Depth-First Search (DFS) Undirected graph. G = (V, E)! V = nodes (vertices)! E = edges between pairs of nodes.! Captures pairwise relationship between objects.! Graph size parameters: n = |V|, m = |E|. 1 2 3 5 4 V = {1, 2, 3, 4, 5}! E = {(1,2), (1,4), (1,5), (2,3), (2,4), ! (3,5)}! n=5! m=6 Graphs Four Degrees of Separation Four Degrees of Separation ⇤ † † † ⇤ ⇤ † Lars† Backstrom Boldi Marco Rosa Johan Ugander Lars Backstrom⇤ Paolo Boldi Marco RosaPaolo Johan Ugander Sebastiano Vigna January 6, 2012 I Facebook: how many “degrees of separation” between me and Barck Obama? I And many more. . . studying the distance distribution of very large graphs: distribution HyAbstract studying the distance of ve perANF [3]. Building on previousperANF graph compression [4] work [3]. Building on previous graph Frigyes Karinthy, in his 1929 shortKarinthy, story “Láncszemek” on the idea“Láncszemek” of diffusive computation pioneered [21], computati Frigyes in his 1929and short story and on the idea of in diffusive (“Chains”) suggested that any(“Chains”) two persons are distanced by two the persons new toolare made it possible accurately suggested that any distanced by tothe new toolstudy madethe it dispossible to accu 1 at most six friendship links.1 at Stanley in his links. famous tance distribution orders of magnitude larger than orders of m most Milgram six friendship Stanley Milgram inofhisgraphs famous tance distribution of graphs experiment [20, 23] challengedexperiment people to route postcards to a it was previously possible. [20, 23] challenged people to route postcards to a it was previously possible. fixed recipient by passing them onlyrecipient through by direct acquainOnethrough of the goals in studying distribution the fixed passing them only direct acquain-the distance One of the goals in is studying the dista tances. The average number tances. of intermediaries on the path ofidentification of interesting statistical parameters that can statistical The average number intermediaries on the path identification of interesting of the postcards lay between of 4.4the andpostcards 5.7, depending on the 4.4 beand used5.7, to depending tell proper on social from other complex lay between thenetworks be used to tell proper social networks sample of people chosen. networks, such as web graphs. More generally, the distance sample of people chosen. networks, such as web graphs. More g We report the results of the first world-scale social-network distribution is one interesting global feature that makes it We report the results of the first world-scale social-network distribution is one interesting global f graph-distance computations, using the entire Facebook net- possible to reject probabilistic models even when they match graph-distance computations, using the entire Facebook netpossible to reject probabilistic models e work of active users (⇡ 721 million users, ⇡ 69 billion friend- local features such as the in-degree distribution. work of active users (⇡ 721 million users, ⇡ 69 billion friendlocal features such as the in-degree dist ship links). The average distance we observe is 4.74, corIn particular, earlier work had shown that the spid 2 , ship or links). The average distance we observe is 4.74, corIn distance particular, earlier work had sh responding to 3.74 intermediaries “degrees of separation”, which measures the dispersion of the distribution, responding to 3.74 intermediaries or “degrees of separation”, which measures thefor dispersion of the showing that the world is even smaller than we expected, and appeared to be smaller than 1 (underdispersion) soshowing that the world even smaller than we expected, and appeared to be smaller prompting the title of this paper. More generally, we isstudy cial networks, but larger than one (overdispersion) for webthan 1 (und prompting the title of this paper. More generally, we study the distance distribution of Facebook and of some interest- graphs [3]. Hence, during the talk, cial networks, larger one of thebut main openthan one (ov the distance distribution of Facebook and some is interesting geographic subgraphs, looking also at their evolution over graphs [3]. Hence, during the talk, o questions wasof“What the spid of Facebook?”. ing geographic subgraphs, looking Lars also at their evolution overto listen time. questions wastalk, “What the spid of Face Backstrom happened to the and issugThe networks we are able totime. explore are almost two orders gested a collaboration studying theLars Backstrom to listen Facebook graph.happened This of magnitude larger than those analysed in the previous liter-to explore The networks we are able are almost two orders was of course an extremely intriguing possibility: beside testgested a collaboration studying the Fa ature. We report detailed statistical metadata showing that analysed of magnitude larger than those in the previous computing liter- wasthe ing the “spid hypothesis”, distanceandistribution of course extremely intriguing po our measurements (which rely on probabilistic algorithms) ature. We report detailed statistical showing thathaveing of the metadata Facebook graph would been largest Milgram-computing the thethe “spid hypothesis”, are very accurate. our measurements (which rely like on probabilistic algorithms) [20] experiment ever performed, of magnitudes of theorders Facebook graph would have been larger than previous attempts (during our experiments are very accurate. like [20] experiment Faceever performed, o book has ⇡ 721 million active users andthan ⇡ 69previous billion friendlarger attempts (during o 1 Introduction ship links). book has ⇡ 721 million active users an Abstract v:1111.4570v3 [cs.SI] 5 Jan 2012 Google Maps: what is the shortest driving route from South Hadley to Florida? Xiv:1111.4570v3 [cs.SI] 5 Jan 2012 I Sebastian January 6, 2012 Terminology If e = (u, v) is an edge, then: (1) u is a neighbor of v (2) u is adjacent to v (3) e is incident on u and v (4) u and v are the endpoints of e Definitions 1 2 3 5 4 Path Distance The distance from u to v is the minimum number of edges in any path from u to v A path is a sequence P of nodes v1, v2, …, vk-1, vk with the property that each consecutive pair vi, vi+1 is joined by an edge in E. 1 1 2 3 2 1-4-2 is a path.! 1-3-4 is NOT a path. 3 5 5 4 Connectivity An undirected graph is connected if for every pair of nodes u and v, there is a path between u and v. A cycle is a path v1, v2, …, vk-1, vk in which v1 = vk, k > 2, and the first k-1 nodes are all distinct. 1 2 3 5 distance(1,2) = 1! distance(1,3) = 2 4 Cycle 1 1 is adjacent to 2! (1,2) is incident on 1 and 2 4 1-2-4-1 is a cycle.! 1-2-4 is NOT a cycle.! 1-2-4-1-5 is NOT a cycle.! 1-2-4-1-5-3-2-1 is NOT a cycle. 2 5 4 1 2 5 4 3 is a connected graph. 3 is NOT a connected graph. Parents, descendants, ancestors? (Upside-down) Trees Trees A tree is an undirected graph that is connected and does not contain a cycle. 1 2 3 1 2 5 3 5 4 1 2 is a tree 1 2 2 5 3 5 4 is NOT a tree 4 1 3 4 5 http://www.offbeattravel.com/MoCA.html Review Definitions Graph Traversal Is a graph connected? What to know: n, m, neighbor, incident, path, distance, cycle, connected, tree! 1 2 3 5 4 Example on board easy Graph Traversal hmmm... Is a Graph Connected? Algorithm 1: Breadth-first search (BFS) Explore outward by distance Is a graph connected?! Approach: explore outward from arbitrary starting node s to find all nodes reachable from s (connected component) a c b d Start at a: e Visit all nodes at distance 1 from a: a c b d Visit all nodes at distance 2 from a: a c b d e e 4 3 Breadth-First Search BFS Tree Layers I I I I I L0 = {s} L1 = all neighbors of L0 L2 = all nodes with an edge to L1 that don’t belong to L0 or L1 ... Li+1 = nodes with an edge to Li that don’t belong to an earlier layer: If we keep only the edges traversed while doing a breadth-first search, we will have a tree Example on board Li+1 = {v : 9(u, v) 2 E, u 2 Li , v 2 / (L0 [ . . . [ Li )} Observation: Li consists of all nodes at distance exactly i from s. There is a path from s to t if and only if t appears in some layer. A More General Strategy BFS Tree Property. Let T be a BFS tree of G = (V, E), and let (x, y) be an edge of G. Then the layer of x and y differ by at most 1. a Layer 0: {a}! Layer 1: {b, c, d}! Layer 2: {e} c e b d To explore the connected component, add any node v for which I (u, v) is an edge I u is explored, but v is not Picture on board Proof on board DFS Algorithm Is a Graph Connected? Algorithm 2: Depth-first search (DFS) - Keep exploring from most recently added node until you have to backtrack a c a c e b d a c b d a c b d e b d a c e e b d e DFS(u) Mark u as ”Explored” for each edge (u, v) incident to u do if v is not marked ”Explored” then Recursively invoke DFS(v) end if end for Summary Depth First Search Theorem: Let T be a depth-first search tree. Let x and y be 2 nodes in the tree. Let (x, y) be an edge that is in G but not in T. Then either x is an ancestor of y or y is an ancestor of x in T. a a b e d b Proof? I G = (V, E), n = |V |, m = |E| neighbor, incident, cycle, path, connected BFS and DFS I e d I I c c Definitions I Two ways to traverse a graph, each produces a tree BFS tree: shallow and wide (“bushy”) DFS tree: deep and narrow (“scraggly”)