slides9-24 - Columbia University

advertisement
Algorithms for Data Science
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, September 24, 2015
Outline
1 Recap
2 Applications of DFS
Cycle detection
Topological sorting
Strongly connected components
Today
1 Recap
2 Applications of DFS
Cycle detection
Topological sorting
Strongly connected components
Review of the last lecture
1. Applications of BFS
I
I
Connected components in undirected graphs
Testing bipartiteness
2. DFS
I
I
Classification of graph edges in directed graphs: back,
forward, cross
Time intervals of vertices, identifying the type of an edge
from the time intervals of its endpoints
Finding your way in a maze
Depth-first search (DFS): starting from a vertex s, explore
the graph as deeply as possible, then backtrack
1. Try the first edge out of s, towards some node v.
2. Continue from v until you reach a dead end, that is a node
whose neighbors have all been explored.
3. Backtrack to the first node with an unexplored neighbor
and repeat 2.
Remark: DFS answers s-t connectivity
Directed graphs: classification of edges
Graph edges that do not belong to the DFS tree(s) may be
1. forward: from a vertex to a descendant (other than a child)
2. back: from a vertex to an ancestor
3. cross: from right to left (no ancestral relation), that is
I
I
from tree to tree
between nodes in the same tree but on different branches
On the time intervals of vertices u, v
If we use an explicit stack, then
I
start(u) is the time when u is pushed in the stack
I
f inish(u) is the time when u is popped from the stack
(that is, all of its neighbors have been explored).
Intervals [start(u), f inish(u)] and [start(v), f inish(v)] either
I
contain each other (u is an ancestor of v or vice versa); or
I
they are disjoint.
Classifying edges using time
1. Edge (u, v) ∈ E is a back edge in a DFS tree if and only if
start(v) < start(u) < f inish(u) < f inish(v).
2. Edge (u, v) ∈ E is a forward edge if
start(u) < start(v) < f inish(v) < f inish(u).
3. Edge (u, v) ∈ E is a cross edge if
start(v) < f inish(v) < start(u) < f inish(u).
Today
1 Recap
2 Applications of DFS
Cycle detection
Topological sorting
Strongly connected components
Application I: Cycle detection
Claim 1.
G = (V, E) has a cycle if and only if DFS(G) yields a back edge.
Proof.
If (u, v) is a back edge, together with the path on the DFS tree
from v to u, it forms a cycle.
Conversely, suppose G has a cycle. Let v be the first vertex
from the cycle discovered by DFS(G). Let (u, v) be the
preceding edge in the cycle. Since there is a path from v to
every vertex in the cycle, all vertices in the cycle are now
discovered and fully explored before v is popped from the
stack. Hence the interval of u is contained in the interval of v.
By Claim 1, (u, v) is a back edge.
Application II: Topological sorting in DAGs
I
An undirected acyclic graph has an extremely simple
structure: it is a tree, hence a sparse graph (O(n) edges).
I
A directed acyclic graph (DAG) may be dense (Ω(n2 )
edges): e.g., V = {1, . . . , n}, E = {(i, j) if i < j }.
1
4
2
3
Topological sorting: motivation
Input:
I
a set of tasks {1, 2, . . . , n} that need to be performed
I
a set of dependencies, each of the form (i, j), indicating
that task i must be performed before task j.
Output: a valid order in which the tasks may be performed, so
that all dependencies are respected.
Example: tasks are courses and certain courses must be taken
before others.
How can we model this problem using a graph? What kind of
graph must arise and why?
Topological ordering: definition
Definition 1.
A topological ordering of G is an ordering of its nodes as
1, 2, . . . , n such that for every edge (i, j), we have i < j.
I
All edges point forward in the topological ordering.
I
It provides an order in which all tasks can be safely
performed: when we try to perform task j, all tasks
required to precede it have already been done.
Example of DAG and its topological sorting
2
6
3
5
7
1
2
3
4
5
4
1
6
7
A DAG (top left), its topological sort (top right) and a drawing
emphasizing the topological sort (bottom).
Topological sorting in DAGs
Claim 2.
If G has a topological ordering, then G is a DAG.
Proof: By contradiction (exercise).
A visualization of the proof is provided by the linearized graph
of the previous slide: vertices appear in increasing order, edges
go from left to right, hence no cycles.
Is the converse true: does every DAG have a topological
ordering? And how can we find it?
Structural properties of DAGs
In a DAG, can every vertex have
I
an outgoing edge?
I
an incoming edge?
Definition 2 (source and sink).
A source is a node with no incoming edges.
A sink is a node with no outgoing edges.
Fact 3.
Every DAG has at least one source and at least one sink.
How can we use Fact 3 to find a topological order?
The node that we label first in the topological sorting must have
no incoming edges. Fact 3 guarantees that such a node exists.
Fact 4.
Let G0 be the graph after a source node and its adjacent edges
have been removed. Then G0 is a DAG.
Proof: removing edges from G cannot yield a cycle!
This gives rise to a recursive algorithm for finding the
topological order of a DAG. Its correctness can be shown by
induction (use Facts 3, 4 to show induction step).
Algorithm for topological sorting
TopologicalOrder(G)
1. Find a source vertex s and order it first.
2. Delete s and its adjacent edges from G; let G0 be the new
graph.
3. TopologicalOrder(G0 )
4. Append the order found after s.
Running time: O(n2 ). Can be improved to O(n + m).
Topological sorting via DFS
Let G = (V, E) be a DAG.
I
Run DFS(G); compute f inish times.
I
Process the tasks in decreasing order of f inish times.
Running time: O(m + n)
Intuition behind this algorithm
I
The task v with the largest f inish has no incoming edges
(if it had an incoming edge from some other task u, then u
would have the largest f inish). Hence v does not depend
on any other task and it is safe to perform it first.
I
The same reasoning shows that the task w with the second
largest f inish has no incoming edges from any other task
except (maybe) task v. Hence it is safe to perform w
second.
I
And so on and so forth.
Formal proof of correctness
By Claim 1 there are no back edges in the DFS forest of a
DAG. Thus every edge (u, v) ∈ E is either
1. forward/tree: start(u) < start(v) < f inish(v) < f inish(u)
s
u
v
2. or cross edge: f inish(v) < start(u) < f inish(u)
s
u
v
Proof of correctness (cont’d)
Hence for every (u, v) ∈ E, f inish(v) < f inish(u).
Consider a task v. All tasks u upon which v depends, that is,
all tasks u such that there is an edge (u, v) ∈ E, satisfy
f inish(v) < f inish(u).
Since we are processing tasks in decreasing order of finish times,
all tasks u upon which v depends have already been processed
before we start processing v.
Exploring the connectivity of a graph
I
Undirected graphs: find all connected components
I
Directed graphs: find all strongly connected components
(SCCs)
I
SCC(u) = set of nodes that are reachable from u and have
a path back to u
I
SCCs provide a hierarchical view of the connectivity of the
graph:
I
I
on a top level, the meta-graph of SCCs has a useful and
simple structure (coming up);
each meta-vertex of this graph is a fully connected
subgraph that we can further explore.
How can we find SCC(u) using BFS?
1. Run BFS(u); the resulting tree T consists of the set of
nodes to which there is a path from u.
2. Define Gr as the reverse graph, where edge (i, j) becomes
edge (j, i).
3. Run BFS(u) in Gr ; the resulting BFS tree T 0 consists of the
set of nodes that have a path to u.
4. The common vertices in T , T 0 compose the strongly
connected component of u.
What if we want all the SCCs of the graph?
The meta-graph of SCCs of a directed graph
3
1
5
7
2
4
6
Consider the meta-graph of all SCCs of G.
I
Make a (super)vertex for every SCC.
I
Add a (super)edge from SCC Ci to SCC Cj if there is an
edge from some vertex u of Ci to some vertex v of Cj .
What kind of graph is the meta-graph of SCC’s?
The meta-graph of SCCs of a directed graph
C1
1
3
5
7
2
C2
6
4
C3
Consider the meta-graph of all SCCs of G.
I
Make a (super)vertex for every SCC.
I
Add a (super)edge from SCC Ci to SCC Cj if there is an
edge from some vertex u of Ci to some vertex v of Cj .
This graph is a DAG.
Is there an SCC we could process first?
3
1
5
7
2
4
6
Suppose we had a sink SCC of G, that is, an SCC with no
outgoing edges.
1. What will DFS discover starting at a node of a sink SCC?
2. How do we find a node that for sure lies in a sink SCC?
3. How do we continue to find all other SCCs?
Easier to find a node in a source SCC!
Fact 5.
The node assigned the largest f inish time when we run DFS(G)
belongs to a source SCC in G.
Example: v5 belongs to source SCC C2 .
Proof.
We will use Lemma 6 below. Let G be a directed graph. The
meta-graph of its SCCs is a DAG. For an SCC C, let
f inish(C) = max f inish(v)
v∈C
Example: f inish(C1 ) = f inish(v1 ) = 8.
Lemma 6.
Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) ∈ E
such that u ∈ Ci and v ∈ Cj . Then f inish(Ci ) > f inish(Cj ).
Gr is useful again
I
Fact 5 provides a direct way to find a node in a source SCC
of G: pick the node with largest f inish.
I
But we want a node in a sink SCC of G!
I
Consider Gr , the graph where the edges of G are reversed.
How do the SCCs of G and Gr compare?
I
Run DFS on Gr : the node with the largest f inish comes
from a source SCC of Gr (Fact 5). This is a sink SCC of G!
Using this observation to find all SCCs
We now know how to find a sink SCC in G.
1. Run DFS(Gr ); compute f inish times.
2. Run DFS(G) starting from the node with the largest f inish:
the nodes in the resulting tree T form a sink SCC in G.
How do we find all remaining SCCs?
I
Remove T from G; let G0 be the resulting graph.
I
The meta-graph of SCCs of G0 is a DAG, hence it has at
least one sink SCC.
I
Apply the procedure above recursively on G0 .
Algorithm for finding SCCs in directed graphs
SCC(G = (V, E))
1. Compute Gr .
2. Run DFS(Gr ); compute f inish(u) for all u.
3. Run DFS(G) in decreasing order of f inish(u).
4. Output the vertices of each tree in the DFS forest of line 3
as an SCC.
Remark 1.
1. Running time: O(n + m) —why?
2. Equivalently, we can (i) run DFS(G), compute f inish times;
(ii) run DFS(Gr ) by decreasing order of f inish. Why?
A directed graph and its DFS forest with time intervals
3
1
5
7
2
4
6
1 (1,8)
2 (2,5)
3 (3,4)
5
4 (6,7)
(9,14)
6 (10,13)
7 (11,12)
DFS forest of Gr ; nodes are considered by decreasing
f inish times
(8)
v₁
(14)
v₅
(13)
v₇
(4)
v₃
v₂ (5)
v₄ (7)
v₅
v₁
v₇
v₃
v₆
v₂
v₆ (12)
v₄
Still need to prove Lemma 6
Let G be a directed graph. The meta-graph of its SCCs is a
DAG.
For an SCC C, let
f inish(C) = max f inish(v)
v∈C
Lemma 7.
Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) ∈ E
such that u ∈ Ci and v ∈ Cj . Then f inish(Ci ) > f inish(Cj ).
Proof of Lemma 6
There are two cases to consider:
1. start(u) < start(v) (DFS starts at Ci )
I
Before leaving u, DFS will explore edge (u, v).
I
Since v ∈ Cj , all of Cj will now be explored.
I
Since there is no edge from Cj back to Ci (DAG!), all
vertices in Cj will be assigned f inish times before DFS
backtracks to u and assigns a f inish time to u. Thus
f inish(Cj ) < f inish(u) ≤ f inish(Ci )
Proof of Lemma 6 (cont’d)
2. start(u) > start(v) (DFS starts at Cj )
Since there is no edge from Cj to Ci , DFS will finish
exploring Cj before it restarts from some vertex that will
result in discovery of Ci . Thus
f inish(Cj ) < start(u) < f inish(u)
⇒ f inish(Cj ) < f inish(Ci )
Download