Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Thursday, September 24, 2015 Outline 1 Recap 2 Applications of DFS Cycle detection Topological sorting Strongly connected components Today 1 Recap 2 Applications of DFS Cycle detection Topological sorting Strongly connected components Review of the last lecture 1. Applications of BFS I I Connected components in undirected graphs Testing bipartiteness 2. DFS I I Classification of graph edges in directed graphs: back, forward, cross Time intervals of vertices, identifying the type of an edge from the time intervals of its endpoints Finding your way in a maze Depth-first search (DFS): starting from a vertex s, explore the graph as deeply as possible, then backtrack 1. Try the first edge out of s, towards some node v. 2. Continue from v until you reach a dead end, that is a node whose neighbors have all been explored. 3. Backtrack to the first node with an unexplored neighbor and repeat 2. Remark: DFS answers s-t connectivity Directed graphs: classification of edges Graph edges that do not belong to the DFS tree(s) may be 1. forward: from a vertex to a descendant (other than a child) 2. back: from a vertex to an ancestor 3. cross: from right to left (no ancestral relation), that is I I from tree to tree between nodes in the same tree but on different branches On the time intervals of vertices u, v If we use an explicit stack, then I start(u) is the time when u is pushed in the stack I f inish(u) is the time when u is popped from the stack (that is, all of its neighbors have been explored). Intervals [start(u), f inish(u)] and [start(v), f inish(v)] either I contain each other (u is an ancestor of v or vice versa); or I they are disjoint. Classifying edges using time 1. Edge (u, v) ∈ E is a back edge in a DFS tree if and only if start(v) < start(u) < f inish(u) < f inish(v). 2. Edge (u, v) ∈ E is a forward edge if start(u) < start(v) < f inish(v) < f inish(u). 3. Edge (u, v) ∈ E is a cross edge if start(v) < f inish(v) < start(u) < f inish(u). Today 1 Recap 2 Applications of DFS Cycle detection Topological sorting Strongly connected components Application I: Cycle detection Claim 1. G = (V, E) has a cycle if and only if DFS(G) yields a back edge. Proof. If (u, v) is a back edge, together with the path on the DFS tree from v to u, it forms a cycle. Conversely, suppose G has a cycle. Let v be the first vertex from the cycle discovered by DFS(G). Let (u, v) be the preceding edge in the cycle. Since there is a path from v to every vertex in the cycle, all vertices in the cycle are now discovered and fully explored before v is popped from the stack. Hence the interval of u is contained in the interval of v. By Claim 1, (u, v) is a back edge. Application II: Topological sorting in DAGs I An undirected acyclic graph has an extremely simple structure: it is a tree, hence a sparse graph (O(n) edges). I A directed acyclic graph (DAG) may be dense (Ω(n2 ) edges): e.g., V = {1, . . . , n}, E = {(i, j) if i < j }. 1 4 2 3 Topological sorting: motivation Input: I a set of tasks {1, 2, . . . , n} that need to be performed I a set of dependencies, each of the form (i, j), indicating that task i must be performed before task j. Output: a valid order in which the tasks may be performed, so that all dependencies are respected. Example: tasks are courses and certain courses must be taken before others. How can we model this problem using a graph? What kind of graph must arise and why? Topological ordering: definition Definition 1. A topological ordering of G is an ordering of its nodes as 1, 2, . . . , n such that for every edge (i, j), we have i < j. I All edges point forward in the topological ordering. I It provides an order in which all tasks can be safely performed: when we try to perform task j, all tasks required to precede it have already been done. Example of DAG and its topological sorting 2 6 3 5 7 1 2 3 4 5 4 1 6 7 A DAG (top left), its topological sort (top right) and a drawing emphasizing the topological sort (bottom). Topological sorting in DAGs Claim 2. If G has a topological ordering, then G is a DAG. Proof: By contradiction (exercise). A visualization of the proof is provided by the linearized graph of the previous slide: vertices appear in increasing order, edges go from left to right, hence no cycles. Is the converse true: does every DAG have a topological ordering? And how can we find it? Structural properties of DAGs In a DAG, can every vertex have I an outgoing edge? I an incoming edge? Definition 2 (source and sink). A source is a node with no incoming edges. A sink is a node with no outgoing edges. Fact 3. Every DAG has at least one source and at least one sink. How can we use Fact 3 to find a topological order? The node that we label first in the topological sorting must have no incoming edges. Fact 3 guarantees that such a node exists. Fact 4. Let G0 be the graph after a source node and its adjacent edges have been removed. Then G0 is a DAG. Proof: removing edges from G cannot yield a cycle! This gives rise to a recursive algorithm for finding the topological order of a DAG. Its correctness can be shown by induction (use Facts 3, 4 to show induction step). Algorithm for topological sorting TopologicalOrder(G) 1. Find a source vertex s and order it first. 2. Delete s and its adjacent edges from G; let G0 be the new graph. 3. TopologicalOrder(G0 ) 4. Append the order found after s. Running time: O(n2 ). Can be improved to O(n + m). Topological sorting via DFS Let G = (V, E) be a DAG. I Run DFS(G); compute f inish times. I Process the tasks in decreasing order of f inish times. Running time: O(m + n) Intuition behind this algorithm I The task v with the largest f inish has no incoming edges (if it had an incoming edge from some other task u, then u would have the largest f inish). Hence v does not depend on any other task and it is safe to perform it first. I The same reasoning shows that the task w with the second largest f inish has no incoming edges from any other task except (maybe) task v. Hence it is safe to perform w second. I And so on and so forth. Formal proof of correctness By Claim 1 there are no back edges in the DFS forest of a DAG. Thus every edge (u, v) ∈ E is either 1. forward/tree: start(u) < start(v) < f inish(v) < f inish(u) s u v 2. or cross edge: f inish(v) < start(u) < f inish(u) s u v Proof of correctness (cont’d) Hence for every (u, v) ∈ E, f inish(v) < f inish(u). Consider a task v. All tasks u upon which v depends, that is, all tasks u such that there is an edge (u, v) ∈ E, satisfy f inish(v) < f inish(u). Since we are processing tasks in decreasing order of finish times, all tasks u upon which v depends have already been processed before we start processing v. Exploring the connectivity of a graph I Undirected graphs: find all connected components I Directed graphs: find all strongly connected components (SCCs) I SCC(u) = set of nodes that are reachable from u and have a path back to u I SCCs provide a hierarchical view of the connectivity of the graph: I I on a top level, the meta-graph of SCCs has a useful and simple structure (coming up); each meta-vertex of this graph is a fully connected subgraph that we can further explore. How can we find SCC(u) using BFS? 1. Run BFS(u); the resulting tree T consists of the set of nodes to which there is a path from u. 2. Define Gr as the reverse graph, where edge (i, j) becomes edge (j, i). 3. Run BFS(u) in Gr ; the resulting BFS tree T 0 consists of the set of nodes that have a path to u. 4. The common vertices in T , T 0 compose the strongly connected component of u. What if we want all the SCCs of the graph? The meta-graph of SCCs of a directed graph 3 1 5 7 2 4 6 Consider the meta-graph of all SCCs of G. I Make a (super)vertex for every SCC. I Add a (super)edge from SCC Ci to SCC Cj if there is an edge from some vertex u of Ci to some vertex v of Cj . What kind of graph is the meta-graph of SCC’s? The meta-graph of SCCs of a directed graph C1 1 3 5 7 2 C2 6 4 C3 Consider the meta-graph of all SCCs of G. I Make a (super)vertex for every SCC. I Add a (super)edge from SCC Ci to SCC Cj if there is an edge from some vertex u of Ci to some vertex v of Cj . This graph is a DAG. Is there an SCC we could process first? 3 1 5 7 2 4 6 Suppose we had a sink SCC of G, that is, an SCC with no outgoing edges. 1. What will DFS discover starting at a node of a sink SCC? 2. How do we find a node that for sure lies in a sink SCC? 3. How do we continue to find all other SCCs? Easier to find a node in a source SCC! Fact 5. The node assigned the largest f inish time when we run DFS(G) belongs to a source SCC in G. Example: v5 belongs to source SCC C2 . Proof. We will use Lemma 6 below. Let G be a directed graph. The meta-graph of its SCCs is a DAG. For an SCC C, let f inish(C) = max f inish(v) v∈C Example: f inish(C1 ) = f inish(v1 ) = 8. Lemma 6. Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) ∈ E such that u ∈ Ci and v ∈ Cj . Then f inish(Ci ) > f inish(Cj ). Gr is useful again I Fact 5 provides a direct way to find a node in a source SCC of G: pick the node with largest f inish. I But we want a node in a sink SCC of G! I Consider Gr , the graph where the edges of G are reversed. How do the SCCs of G and Gr compare? I Run DFS on Gr : the node with the largest f inish comes from a source SCC of Gr (Fact 5). This is a sink SCC of G! Using this observation to find all SCCs We now know how to find a sink SCC in G. 1. Run DFS(Gr ); compute f inish times. 2. Run DFS(G) starting from the node with the largest f inish: the nodes in the resulting tree T form a sink SCC in G. How do we find all remaining SCCs? I Remove T from G; let G0 be the resulting graph. I The meta-graph of SCCs of G0 is a DAG, hence it has at least one sink SCC. I Apply the procedure above recursively on G0 . Algorithm for finding SCCs in directed graphs SCC(G = (V, E)) 1. Compute Gr . 2. Run DFS(Gr ); compute f inish(u) for all u. 3. Run DFS(G) in decreasing order of f inish(u). 4. Output the vertices of each tree in the DFS forest of line 3 as an SCC. Remark 1. 1. Running time: O(n + m) —why? 2. Equivalently, we can (i) run DFS(G), compute f inish times; (ii) run DFS(Gr ) by decreasing order of f inish. Why? A directed graph and its DFS forest with time intervals 3 1 5 7 2 4 6 1 (1,8) 2 (2,5) 3 (3,4) 5 4 (6,7) (9,14) 6 (10,13) 7 (11,12) DFS forest of Gr ; nodes are considered by decreasing f inish times (8) v₁ (14) v₅ (13) v₇ (4) v₃ v₂ (5) v₄ (7) v₅ v₁ v₇ v₃ v₆ v₂ v₆ (12) v₄ Still need to prove Lemma 6 Let G be a directed graph. The meta-graph of its SCCs is a DAG. For an SCC C, let f inish(C) = max f inish(v) v∈C Lemma 7. Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) ∈ E such that u ∈ Ci and v ∈ Cj . Then f inish(Ci ) > f inish(Cj ). Proof of Lemma 6 There are two cases to consider: 1. start(u) < start(v) (DFS starts at Ci ) I Before leaving u, DFS will explore edge (u, v). I Since v ∈ Cj , all of Cj will now be explored. I Since there is no edge from Cj back to Ci (DAG!), all vertices in Cj will be assigned f inish times before DFS backtracks to u and assigns a f inish time to u. Thus f inish(Cj ) < f inish(u) ≤ f inish(Ci ) Proof of Lemma 6 (cont’d) 2. start(u) > start(v) (DFS starts at Cj ) Since there is no edge from Cj to Ci , DFS will finish exploring Cj before it restarts from some vertex that will result in discovery of Ci . Thus f inish(Cj ) < start(u) < f inish(u) ⇒ f inish(Cj ) < f inish(Ci )