November 30, 2014 Graphs 1: Representation, Search, and Traverse This notes chapter begins with a review of graph and digraph concepts and terminology and introduces the adjacency matrix and adjacency list representation systems. Then we study the depth-first and breadth-first search processes and elaborations on them, including DFS and BFS traversals and the ultimate Survey algorithms. Outside the Box Best Lecture Slides Best Video Demos PU UCB DFSurvey (undirected) DFSurvey (directed) BFSurvey (undirected) BFSurvey (directed) LIB/area51/fdfs LIB/area51/fdfs LIB/area51/fbfs LIB/area51/fbfs ug.x dg.x ug.x dg.x Textbook deconstruction. This notes chapter corresponds roughly to Chapter 22 in [Cormen et al 3e], which covers graph (and digraph) concepts and representations and the basic depth- and breadth-first search processes. The notation G = (V, E) is common to the text, the video linked here, and these notes. 1 Definitions, Notation, and Representations Graphs and directed graphs are studied extensively in the pre-requisit math sequence Discrete Mathematics I, II [FSU::MAD 2104 and FSU::MAD 3105], so these notes will not dwell on details. We will however mention key concepts in order to establish notation and recall familiarity. Where there may be slight variations in terminology, we side with the textbook (see [Cormen::Appendix B.4]). Assume that G = (V, E) is a graph (directed ot undirected). 1.1 Undirected Graphs - Theory and Terminology • • • • • Undirected Graph: aka Ungraph, Bidirectional Graph, Bigraph; Graph Vertices, vSize = |V | = the number of vertices of G Edges, eSize = |E| = the number of edges of G. Adjacency: Vertices x and y are said to be adjacent iff [x, y] ∈ E. Path from v to w: a set {x0 , . . . , xk } of vertices such that x0 = v, xk = w, and [xi−1 , xi ] is an edge for i = 1, . . . , k. The edges [xi−1 , xi ] are called the edges of the path and k is the length of the path. A path is simple if the vertices 1 2 • • • • defining it are distinct - that is, not repeated anywhere along the path. If p is a path from v to w we say w is reachable from v. Connected Graph: for all vertices v, w ∈ V , w is reachable from v. Cycle: A path of length at least 3 from a vertex to itself. Acyclic Graph: Graph with no cycles Degree of a vertex v: deg(v) = the number of edges with one end at v. 1.2 Directed Graphs - Theory and Terminology • • • • • • • • • • Directed Graph: aka Digraph Vertices, vSize = |V | = the number of vertices of G Edges, eSize = |E| = the number of (directed) edges of G. Adjacency: Vertex x is adjacent to vertex y, and y is adjacent from x, iff (x, y) ∈ E. Path from v to w: a set {x0 , . . . , xk } of vertices such that x0 = v, xk = w, and (xi−1 , xi ) is an edge for i = 1, . . . , k. The edges (xi−1 , xi ) are called the edges of the path and k is the length of the path. A path is simple if the vertices defining it are distinct - that is, not repeated anywhere along the path. If p is a path from v to w we say w is reachable from v. Note: that these edges are all directed and must orient in the forward direction. Strongly Connected Digraph: for all vertices v, w ∈ V , w is reachable from v. Note this means the existence of two directed paths - one from v to w and the other from w to v. Also note that putting the two paths together creates a (directed) cycle containing both v and w. (Directed) Cycle: A path of length at least 1 from a vertex to itself. Note this is distinct from the undirected case, allowing paths of length 1 (so-called self-loops) and length 2 (essentially and out-and-back path, equivalent to a two-way edge). Acyclic Digraph: Digraph with no cycles; DAG inDegree of a vertex v: inDeg(v) = the number of edges to v. outDegree of a vertex v: outDeg(v) = the number of edges from v. There are subtle distinctions in the way graphs and digraphs are treated in definitions. There are also “conversions” from graph to digraph and digraph to graph. • Undirected graph edges have no preferred direction and are represented using the notation [x, y] to mean the un-ordered pair in which [x, y] = [y, x]. Directed graph edges have a preferred direction and are represented using the notation (x, y) to mean the ordered pair in which (x, y) 6= (y, x). (In practice, the notation [x, y] is often dropped when context makes it clear whether the edge is directed or undirected or when both directed and undirected graphs are considered in the same context.) 3 • In an undirected or directed graph, vertex y is said to be a neighbor of the vertex x iff (x, y) ∈ E. In undirected graphs, being neighbors and being adjacent are equivalent. In digraphs, y is a neighbor of x iff y is adjacent from x. (Note this terminology differs from that in [Cormen et al 3e].) • If D = (V, E) is a digraph, the undirected version of D is the graph D0 = (V, E 0 ) where [x, y] ∈ E 0 iff x 6= y and (x, y) ∈ E. Note this replaces, in effect, a 1-way edge with a 2-way edge (or two opposite 1-way edges in the representations), except self-loops are eliminated. • If G = (V, E) is a graph, the directed version of G is the graph G0 = (V, E 0 ) where (x, y) ∈ E 0 iff [x, y] ∈ E. Note this replaces each undirected edge [x, y] with two opposing directed edges (x, y) and (y, x). P Theorem 1u. In an undirected graph, 2 × |E|. v∈V deg(v) =P P Theorem 1d. In a directed graph, v∈V inDeg(v) = v∈V outDeg(v) = |E|. The proof of Theorem 1 uses aggregate analysis. First show the result for directed graphs, where edges are in 1-1 correspondence with their initiating vertices. Then apply 1d to the directed version of an undirected graph and adjust for the double counting. 1.3 Graph and Digraph Representations There are two fundamental families of representation methods for graphs - connectivity matrices and adjacency sets. Two archetypical examples apply to graphs and digraphs G = (V, E) with the assumption that the vertices are numbered starting with 0 (in C fashion): V = {v0 , v1 , v2 , . . . , vn−1 }. 1.3.1 Adjacency Matrix Representation In the adjacency matrix representation we define an n × n matrix M by ( 1 if G has an edge from vi to vj M (i, j) = 0 otherwise M is the adjacency matrix of G. Consider the graph G1 depicted here: Graph G1 0 --- 1 | | 3 --- 4 --- 2 --- 6 --- 7 | | | | 5 --- 8 --- 9 4 The adjacency matrix for G1 is: − 1 1 − − − 1 − − − M = − − − − − − − − − − − − − − − 1 1 − − − 1 − − − 1 − − − − − − − − 1 − 1 − − − − − − 1 − 1 − − − 1 − − − 1 − − − − 1 − − − − − − − − 1 − − 1 − − − − − 1 − − − 1 − − − − − − − 1 1 − (where we have used “−” instead of “0” for readability). Adjacency matrix representations are very handy for some graph contexts. They are easy to define, low in overhead, and there is not a lot of data structure programming involved. And some questions have very quick answers: 1. Using an adjacency matrix representation, the question “Is there an edge from vi to vj ?” is answered by interpreting M (i, j) as boolean, and hence in time Θ(1). Adjacency matrix representations suffer some disadvantages as well, related to the fact that an n × n matrix has Θ(n × n) places to store and visit: 2. An adjacency matrix representation requires Θ(n2 ) storage. 3. A loop touching all edges of a graph (or digraph) represented with an adjacency matrix requires Ω(n2 ) time. Particularly when graphs are sparse - a somewhat loose term implying that the number of edges is much smaller than it might be - these basic storage and traversal tasks should not be so expensive. 1.3.2 Sparse Graphs A graph (or digraph) could in principle have an edge from any vertex to any other.1 In other words the adjacency matrix representation might have 1’s almost anywhere, out of a total of Θ(n2 ) possibilities. In practice, most large graphs encountered in 1We usually don’t allow self-edges in undirected graphs without specifically warning the context that self-edges are allowed. 5 real life are quite sparse, with vertex degree significantly limited, so that the number of edges is O(n) and the degree of each vertex may even be O(1).2 Observe: 4. Suppose d is an upper bound on the degree of the vertices of a graph (or digraph). Then d × n is an upper bound on the number of edges in the graph. So, a graph with, say, 10,000 vertices, and vertex degree bounded by 100, would have no more than 1,000,000 edges, whereas the adjacency matrix would have storage for 100,000,000, with 99,000,000 of these representing “no edge”.3 1.3.3 Adjacency List Representation The adjacency list representation is designed to require storage to represent edges only when they actually exist in the graph. There is per-edge overhead required, but for sparse graphs this overhead is negligible compared to the space used by adjacency matrices to store “no info”. The adjacency list uses a vector v of lists reminiscent of the supporting structure for hash tables. The vector is indexed 0, . . . , n − 1, and the list v[i] consists of the subscripts of vertices that are adjacent from vertex vi . For example, the adjacency list representation of the graph G1 illustrated above is: v[0]: v[1]: v[2]: v[3]: v[4]: v[5]: v[6]: v[7]: v[8]: v[9]: 2For 1 0 5 0 3 2 2 6 5 7 , 3 , , , , , , , , 6 4 5 4 , 8 7 9 9 8 0 --- 1 2 --- 6 --- 7 | | | | | | 3 --- 4 --- 5 --- 8 --- 9 Graph G1 example, Euler’s Formula on planar graphs implies that if G is planar then |E| ≤ 3|V | − 6, hence |E| = O(|V |). 3The human brain can be modeled loosely as a directed graph with vertices representing neuronal cells and edges representing direct communication from one neuron to another. This digraph is estimated to have about 1010 vertices with average degree about 103 . 6 As with the matrix representation above, time and space requirements for adjacency list representations are based on what needs to be traversed and stored: 5. An adjacency list representation requires Θ(|V | + |E|) storage 6. A loop touching all edges of a graph (or digraph) represented with an adjacency list representation requires Ω(|V | + |E|) time. Proofs of the last two factoids use aggregate analysis. Note that when G has many edges, for example when |E| = Θ(|V |2 ), the estimates are the same as for matrix representations. But for sparse graphs, in particular when |E| = O(|V |), the estimates are dramatically better - linear in the number of vertices. 1.3.4 Undirected v. Directed Graphs Both the adjacency matrix and adjacency list can represent either directed or undirected graphs. And in both systems, an undirected edge has a somewhat redundant representation. An adjacency matrix M represents an undirected graph when it is symmetric about the main diagonal: M (i, j) = M (j, i) for all i, j One could think of this constraint as a test whether the underlying graph is directed or not. An adjacency list v[] represents an undirected graph when all edges (x, y) appear twice in the lists - once making y adjacent from x and once making x adjacent from y: ... v[x]: ... , y , ... ... v[y]: ... , x , ... ... 1.3.5 Variations on Adjacency Lists Many graph algorithms require examining all vertices adjacent from a given vertex. For such algorithms, traversals of the adjacency lists will be required. These traversals are inherently Ω(list.size), and replacing list with a faster access time set structure might even slow down the traversal (although not affecting its asymptotic runtime). 7 On the other hand, unlike the situation with hash tables where we could choose the size of the vector to hold the average list size to approximately 2, we have no such luxury in representing graphs. If a graph process requires many direct queries of the form “is y adjacent from x” (as distinct from traversing the entire adjacency list), a question that requires Ω(d) time to answer using sequential search in a list of size d, it can be advantageous to replace list with a faster access set structure, which would allow the question to be answered in time O(log d). (In the graph context, d is the size of the adjacency list and the outDegree of the vertex whose adjacency list is being searched.) 1.3.6 Dealing with Graph Data Often applications require maintaining data associated with vertices and/or edges. Vertex data is easily maintained using auxilliary vectors. Edge data in an adjacency matrix representation is also straightforward to maintain in another matrix. Edge data is slightly more difficuly to maintain in an adjacency list representation, in essence because the edges are only implied by the representation. This problem can be handled in two ways. Adjacency lists can be replaced with edge lists - instead of listing adjacent vertices, list pointers to the edges that connect the adjacent pairs. Then an edge can be as elaborate as needed. It must at minimum know what its two vertex ends are, of course, something like this: template <class T> struct Edge { T data_; // whatever needs to be stored unsigned from_, to_; // vertices of edge } Or a hash table can be set up to store edge data based on keys of the form (x,y), where (x,y) is the edge implied by finding y in the list v[x]. 1.4 Graph Classes We now have four distinct representations to enshrine in code: adjacency matrix and adjacency list representations for both undirected and directed graphs. The following class hierarchy provides a plan for defining these four representations while enforcing terminology uniformity and re-using code by elevating to a parent class where appropriate: 8 Graph GraphMatrixBase UnGraphMatrix DiGraphMatrix GraphListBase UnGraphList DiGraphList // // // // // // // abstract base class base class for adj matrix reps undirected adj matrix representation directed adj matrix representation base class for adj list reps undirected adj list representation directed adj list representation The following pseudo-code provides a plan for implementation: class Graph // abstract base class for all representations { typedef unsigned Vertex; public: virtual void SetVrtxSize (unsigned n) = 0; virtual void AddEdge (Vertex from, Vertex to) = 0; virtual bool HasEdge () const; virtual unsigned VrtxSize () const; virtual unsigned EdgeSize () const; virtual unsigned OutDegree (Vertex v) const = 0; virtual unsigned InDegree (Vertex v) const = 0; ... }; class GraphMatrixBase : public Graph { ... AdjIterator Begin (Vertex x) const; AdjIterator End (Vertex x) const; ... fsu::Matrix m_; }; class GraphListBase : public Graph; { ... AdjIterator Begin (Vertex x) const; AdjIterator End (Vertex x) const; ... fsu::Vector < fsu::List < Vertex > > v_; }; The AdjIterator type needs to be defined for both matrix and list representations. This is a forward iterator traversing the collection of vertices adjacent from v. For the adjacency list representation AdjIterator is a list ConstIterator. Begin can be defined as follows: 9 AdjIterator GraphListBase::Begin (Vertex x) { return v_[x].Begin(); } The following implementations should completely clarify the way edges are represented in all four situations: DiGraphMatrix::AddEdge(Vertex x, Vertex y) { M[x][y] = 1; } UnGraphMatrix::AddEdge(Vertex x, Vertex y) { M[x][y] = 1; M[y][x] = 1; } DiGraphList::AddEdge(Vertex x, Vertex y) { v[x].PushBack(y); } UnGraphList::AddEdge(Vertex x, Vertex y) { v[x].PushBack(y); v[y].PushBack(x); } 2 Search One of the first things we want to do in a graph or digraph is find our way around. There are two widely used and famous processes to perform a search in a graph, both of which have been introduced and used in other contexts: depth-first and breadth-first search. In trees, for example, preorder and postorder traversals follow the depth-first search process, and levelorder traversal follows the breadth-first process. And solutions to maze problems typically use one or the other to construct a solution path from start to goal. Trees and mazes are representable as special kinds of graphs (or digraphs), and that context provides the ultimate generality to study these fundamental algorithms. We will assume throughout the remainder of this chapter that G = (V, E) is a graph, directed or undirected, with |V | vertices and |E| edges. We also assume that the graph is presented to the search algorithms using the adjacency list representation. 10 2.1 Breadth-First Search The Breadth-First Search [BFS] process begins at a vertex of G and explores the graph from that vertex. At any stage in the search, BFS considers all vertices adjacent from the current vertex before proceeding deeper into the graph. Breadth First Search (v) Uses: double ended control queue conQ vector of bool visited for each i, visited[i] = false; conQ.PushBack(v); visited[v] = true; PreProcess(0,v); while (!conQ.Empty()) { f = conQ.Front(); if (n = unvisited adjacent from f) { conQ.PushBack(n); // PushFIFO visited[n] = true; PreProcess(f,n); } else { conQ.PopFront(); PostProcess(f); } } This statement of the algorithm shows the control structure only, which uses the control queue and visited flags, with other activities gathered under Pre- and Post- processing of vertices and edges. The PreProcess and PostProcess calls may be modified to suit the target purpose of the search. If a particular goal vertex g is sought, PostProcess can check whether n = g and return if true. It is usually also desireable to return a solution path, in which case PreProcess can record the information that f is the predecessor of n in the search. The runtime of BFS is straightforward to estimate, ignoring the cost of pre and post processing. The aggregate runtime cost breaks down into three categories: Total Cost = cost of initializing the visited flags + + cost of the calls to queue push/pop operations cost of finding the unvisited adjacents. 11 Initializing the visited flags costs Θ(|V |). There is one push and one pop operation for each edge in the graph that is accessible from v, so that the number of queue operations is no greater than 2|E|. The third term is dependent on the graph representation, which we assume is the adjacency list. Finding the next unvisited vertex adjacent from f requires sequential search of the adjacency list. This search is accomplished using an adjacency iterator that pauses when the next unvisited adjacent is found. The iterator is re-started where it is paused, but never re-initialized. Moreover, a given adjacency list is traversed only one time, when its vertex is at the front of the queue. Therefore the cost of finding unvisited adjacents is the aggregate cost of traversing all of the adjacency lists one time. The aggregate size of adjacency lists is 2|E| for graphs and |E| for digraphs, but also all (reachable) vertices must be tested, so the cost of finding all of the next unvisited adjacents is O(|V | + |E|). Therefore the runtime is bounded above by Θ(|V |) + O(|E|) + O(|V | + |E|): Theorem 2. The runtime of BFS is O(|V | + |E|). We cannot conclude the estimate is tight only because not all edges are accessible from a given vertex. As an example, performing BFS(5) on the graph G1 encounters the vertices as follows: adj list rep -----------v[0]: 1 , 3 v[1]: 0 v[2]: 5 , 6 v[3]: 0 , 4 v[4]: 3 , 5 v[5]: 2 , 4 , 8 v[6]: 2 , 7 v[7]: 6 , 9 v[8]: 5 , 9 v[9]: 7 , 8 Graph G1 -------0 --- 1 | | 3 --- 4 --- 2 --- 6 --- 7 | | | | 5 --- 8 --- 9 BFS::conQ <-------null 5 5 2 5 2 4 5 2 4 8 2 4 8 2 4 8 6 4 8 6 4 8 6 3 8 6 3 8 6 3 9 ... Vertex discovery order: 5 2 4 8 6 3 9 7 0 1 grouped by distance: [ (5) (2 4 8) (6 3 9) (7 0) (1) ] ... 6 3 9 6 3 9 7 3 9 7 3 9 7 0 9 7 0 7 0 0 0 1 1 null 12 2.2 Depth-First Search Like BFS, the Depth-First Search [DFS] process also begins at a vertex of G and explores the graph from that vertex. In contrast to BFS, which considers all adjacent vertices before proceeding deeper into the graph, DFS follows as deep as possible into the graph before backtacking to an unexplored possibility. Depth First Search (v) Uses: double ended control queue conQ vector of bool visited for each i, visited[i] = false; conQ.PushFront(v); visited[v] = true; PreProcess(0,v)}; while (!conQ.Empty()) { f = conQ.Front(); if (n = unvisited adjacent from f) { conQ.PushFront(n); // PushLIFO visited[n] = true; PreProcess(f,n); } else { conQ.PopFront(); PostProcess(f); } } It is remarkable that the only difference between DFS and BFS is in the way unvisited adjacents are place on the control queue conQ. In DFS, conQ has LIFO behavior, functioning as a control stack. In BFS, conQ has FIFO behavior, functioning as a control queue. The effect is that in DFS, the front of conQ is the previously pushed unvisited adjacent vertex, whereas in BFS, the front of conQ remains the same after the unvisited adjacent vertex has been pushed. Theorem 3. The runtime of DFS is O(|V | + |E|). The argument is a repeat of that for Theorem 2. Again we cannot conclude the estimate is tight only because not all edges are accessible from a given vertex. 13 Continuing with the example G1, here is a trace of DFS(5). (The orientation in the way we show conQueue is reversed for DFS.) adj list rep -----------v[0]: 1 , 3 v[1]: 0 v[2]: 5 , 6 v[3]: 0 , 4 v[4]: 3 , 5 v[5]: 2 , 4 , 8 v[6]: 2 , 7 v[7]: 6 , 9 v[8]: 5 , 9 v[9]: 7 , 8 Graph G1 -------0 --- 1 | | 3 --- 4 --- 2 --- 6 | | 5 --- 8 DFS discovery order: 5 2 6 7 8 9 4 3 0 DFS finishing order: 8 9 7 6 2 1 0 3 4 DFS::conQ --------> null ... --- 7 5 5 | 5 2 5 4 | 5 2 6 5 4 3 --- 9 5 2 6 7 5 4 3 0 5 2 6 7 9 5 4 3 0 1 5 2 6 7 9 8 5 4 3 0 5 2 6 7 9 5 4 3 5 2 6 7 5 4 5 2 6 5 5 2 null ... 1 (push order) ‘‘preorder’’ 5 (pop order) ‘‘postorder’’ 2.3 Remarks It is worth noting the memory use patterns for BFS and DFS. In either case, memory use is Θ(|V |) plus the maximum size of the set of gray vertices (those currently in the control queue). In the case of BFS, the gray vertices form a “search frontier” roughly a fixed distance away from the starting vertex v, growing in diameter (and typically size) as it moves away from v. (See Lemma 5 of Section 4.2.) In the case of DFS, the gray vertices represent the current path from v to the most recently discovered vertex. Of course, the sizes of these collections are dependent on the structure of the graph being searched. But typically, the search frontier of BFS can be significantly larger than the longitudinal search path of DFS. This is made quite clear when searching from the root of a binary tree: the BFS frontiers are cross-sections in the tree, approaching |V | in size, whereas the DFS paths are downward paths in the tree, limited to log |V | for a complete tree. One can think of DFS as the strategy a single person might use to find a goal in a maze, armed with a way to mark locations that have been visited. The person proceeds to an adjacent unvisited vertex as long as there is one. If at some point there is no unvisited adjacent vertex, the person backtracks along the marked path to a vertex that has an unvisited adjacent and then proceeds. Similarly, BFS might be a strategy employed by a search party of many people. At each vertex, the party splits into sub-groups, one for each unvisited adjacent, and the subgroup proceeds to that adjacent. 14 This analogy illustrates two important differences between DFS and BFS: (1) BFS discovers a most efficient path to any goal, because the various sub-search-parties operate concurrently and proceed away from the starting vertex. The first party to reach the goal sounds the “found” signal. (2) BFS is more expensive in its use of memory, because there are multiple searches in process concurrently. 15 3 Survey The information that can be extracted during one of our basic search algorithms can be very useful. On the other hand, when one has a specific goal such as discovering a path between two vertices, there may be no need for all that information. If you need to hang a picture on a wall, you don’t need a professional assessment/survey of your property, only a simple tape measure. But when you want a total evaluation, a survey is called for. For graphs and digraphs, we have two such surveys available: breadth-first and depth-first. Breadth-first survey concentrates on distance and depth-first survey concentrates on time. 3.1 Breadth-First Survey It may be past time that we succomb to the temptation to package algorithms as classes. We do that now for the graph surveys. Here is pseudo code for a BFSurvey class: class BFSurvey { public: BFSurvey void Search void Search void Reset ( ( ( ( Vector < int > Vector < Vertex > Vector < Color > const Graph& g ); ); Vertex v ); ); distance; parent; color; // distance from origin // for BFS tree // bookkeeping private: const Graph& g_; Vector < bool > visited_ ; Deque < Vertex > conQ_ ; }; The class contains a reference to a graph object on which the survey is performed, private data used in the algorithm control, and public variables to house three results of the survey - distance, parent, and color for each vertex in the graph. (These could be privatized with accessors and other trimmings for data security.) These data are instantiated by the survey and have the following interpretation when the survey is completed: code distance[x] parent[x] color[x] math d(x) p(x) x.color english the number of edges travelled to get from v to x the vertex from which x was discovered either black or white, depending on whether x was reachable from v 16 During the course of Search, when a vertex is pushed onto the control queue in FIFO order, it is colored gray and assigned distance one more than its parent at the front of the queue. The vertex is colored black when popped from the queue. At any given time during Search, the gray vertices are precisely those in the FIFO control queue. The 1-argument constructor initializes the Graph reference and sets all the various data to the initial/reset state: BFSurvey::BFSurvey (const Graph& g) : distance(g.vSize, g.eSize + 1), parent(g.vSize, null), color(g.vSize, white), visited_(g.vSize, false), g_(g) {} The Search(v) method is essentially BFSearch(v) with the pre- and post-processing functions defined to maintain the survey data. However, the visited flags and parent information are not automatically unset at the start, so that the method can be called more than once to continue the survey in any parts of the graph that were not reachable from v. void BFSurvey::Search( Vertex v ) { conQ_.Push(v); visited_[v] = true; distance[v] = 0; color[v] = grey; while (!conQ_.Empty()) { f = conQ_.Front(); if (n = unvisited adjacent from f in g_) { conQ_.PushBack(n); // PushFIFO visited_[n] = true; distance[n] = distance[f] + 1; parent[n] = f; color[n] = grey; } else { conQ_.PopFront(); color[f] = black; } } } 17 The no-argument Search method repeatedly calls Search(v), thus ensuring that the survey considers the entire graph. Often there are relatively few vertices not reached on the first call, but nevertheless Search() perserveres until every vertex has been discovered. void BFSurvey::Search() { Reset(); for (each vertex v of g_) { if (color[v] == white) Search(v); } } void BFSurvey::Reset() { for (each vertex v of g_) { visited_[v] = 0; distance[v] = g_.eSize + 1; // impossibly large parent[v] = null; color[v] = white; } } We have shown in Theorem 2 of Section 2.1 that BFSearch(v) has runtime O(|V | + |E|), and this bound carries over to BFSurvey::Search(). Note that BFSurvey::Search() touches every edge and vertex in the graph. Therefore we can apply Factoid 6 in Section 1.3.3 to conclude that the bound is tight: Theorem 4. BFSurvey::Search() has runtime Θ(|V | + |E|). 3.2 Depth-First Survey The similarities between depth- and breadth-first search/survey algorithms are far more numerous than the differences. Yet the differences are critical to understanding and using the two. So as tedious as it may be, it is important to concentrate by finding the differences and understanding their consequences. Here is pseudo code for a DFSurvey class: class DFSurvey { public: DFSurvey void Search void Search void Reset ( ( ( ( const Graph& g ); ); Vertex v ); ); 18 Vector Vector Vector Vector < < < < unsigned > unsigned > Vertex > Color > private: unsigned const Graph& Vector < bool > Deque < Vertex > }; dtime; ftime; parent; color; // // // // discovery time finishing time for DFS tree various uses time_; g_; visited_ ; conQ_ ; The class structure is identical to that of BFSurvey, with these exceptions: (1) The two DFSurvey::Search algorithms use DFS [LIFO] rather than BFS [FIFO] (2) DFSurvey maintains global time used to time-stamp discovery time and finishing time for each vertex, rather than the distance data of BFSurvey. These time, parent, and color data are instantiated by the survey and have the following interpretation when the survey is completed: code dtime[x] ftime[x] parent[x] color[x] math td (x) tf (x) p(x) x.color english discovery time = time x is first discovered finishing time = time x is released from further investigation the vertex from which x was discovered black or white, depending on whether x was reachable from v The 1-argument constructor initializes the Graph reference and sets all the various data to the initial/reset state. During the course of Search, vertices are colored gray at discovery time and pushed onto the control queue in LIFO order, and colored black at finishing time and popped from the queue. At any given time during Search, the gray vertices are precisely those in the LIFO control queue. The Search(v) method is DFSearch(v) with the pre- and post-processing functions defined to maintain the survey data. Note however that the visited flags are not automatically unset at the start, so that the method can be called more than once to continue the survey in any parts of the graph that were not reachable from v. Global time is incremented immediately after each use, which ensures that no two time stamps are the same. void DFSurvey::Search( Vertex v ) { dtime[v] = time_++; 19 conQ_.Push(v); visited_[v] = true; color[v] = grey; while (!conQ_.Empty()) { f = conQ_.Front(); if (n = unvisited adjacent from f in g_) { dtime[n] = time_++; conQ_.PushFront(n); // PushLIFO visited_[n] = true; parent[n] = f; color[n] = grey; } else { conQ_.PopFront(); color[f] = black; ftime[f] = time_++; } } } The no-argument Search method repeatedly calls Search(v), thus ensuring that the survey considers the entire graph. void DFSurvey::Search() { Reset(); for (each vertex v of g_) { if (color[v] == white) Search(v); } } void DFSurvey::Reset() { for (each vertex v of g_) { visited_[v] = 0; parent[v] = null; color[v] = white; dtime[v] = 2|V|; // last time stamp is 2|V| -1 ftime[v] = 2|V|; time_ = 0; } } 20 The runtime of DFSurvey::Search() succombs to an argument identical to that for BFSurvey::Search(): Theorem 5. DFSurvey::Search() has runtime Θ(|V | + |E|). 4 Interpreting and Applying Survey Data This section is devoted to theory of BFS and DFS, including verification that the BFS tree is a minimal distance tree and two important analytical results on DFS time stamps. We begin by defining the search trees. 4.1 BFS and DFS Trees Note that for either BFSurvey or DFSurvey, parent informationis collected during the course of the algorithm and stored in the vector parent[]. Assume either DFS or BFS context, and suppose we have run Search() - the complete survey. If parent[x] 6= null, define p(x) = ∗parent[x] = the vertex from which x is discovered, and: T (s) = {(p(x), x)|x 6= s and x is reachable from s} V (s) = {x|x is reachable from s} F = {(p(x), x)|parent[x] 6= null} Lemma 1 (Tree Lemma). After XFSurvey::Search(s), (V (s), T (s)) is a tree with root s. If G is a directed graph, the edges of T (s) are directed away from s. Proof. First note that s is the unique vertex in T (s) with null parent pointer, by inspection of the algorithm. Also note that (p(x), x) is an edge (directed from p(x) to x) in the graph, again by inspection of the algorithm. Following the parent pointers until a null pointer parent[s] = null is reached defines a (directed) path in T (s) from s to x. Now count the vertices and edges. For each vertex x other than s, (p(x), x) is an edge distinct from any other (p(y), y) because x 6= y. Therefore the edges are in 1-1 correspondence with the vertices other than s. We have a connected graph with |V (s)| = |E(s)| + 1, so it must be a tree. We call (V (s), T (s)) the search tree generated by the survey starting at s. 21 Lemma 2 (Forest Lemma). After XFSurvey::Search(), (V, F ) is a forest whose trees are all of the search trees generated during the search: F = ∪s T (s) where the union is over all vertices s with parent[s] = null. Proof. By the Tree Lemma, T (s) is a tree for each starting vertex s. Suppose some edge in F connects two of these trees, say T (s1 ) and T (s2 ). The edge must necessarily be of the form (p(x), x) for some x, where x ∈ T (s1 ) and p(x) ∈ T (s2 ). But then the parent-path from x will pass through p(x) to s2 , which means that s2 should have been discovered by XFSurvey::Search(s1 ). We call (V, F ) the search forest generated by the survey. 4.2 Interpreting BFSurvey We have alluded to the shortest path property of BFS in previous sections. It is time to come to full contact with a proof, and we devote most of the rest of this section to doing that. We follow the proof in [Cormen et al 3e]. For any pair x, y of vertices in G, define the shortest-path-distance from x to y to be δ(x, y) = the length of the shortest path from x to y in G, or δ(x, y) = 1 + |E| if y is not reachable from x. Lemma 3. Let G = (V, E) be a directed or undirected graph and x, y, z ∈ V . If y is reachable from x and z is reachable from y then δ(x, z) ≤ δ(x, y) + δ(y, z). Proof. First note that z is reachable from x by concatenating shortest paths from x to y and y to z. This path from x through y to z has length δ(x, y) + δ(y, z). The shortest path from x to z can be no longer than this path. Therefore δ(x, z) ≤ δ(x, y) + δ(y, z). Note in passing that δ(y, z) = 1 if (y, z) ∈ E. Assumptions. For the remainder of this section, let G = (V, E) be a directed or undirected graph and suppose BFSurvey::Search(s) has been called for some starting vertex s ∈ V . Let d(x) = distance[x] for each x ∈ V . Lemma 4. The path in T (x) from s to x has length d(x). Proof. Use mathematical induction on d(x). Corollary. For each vertex x, d(x) ≥ δ(s, x). 22 Lemma 5. If x and y are both gray vertices (i.e., in the control queue) with x colored gray before y (i.e., x pushed before y), then d(x) ≤ d(y) ≤ d(x) + 1. Lemma 6. If x and y are reachable from s and x is discovered before y, then d(x) ≤ d(y). Proof. Examine the code to see that when x is pushed onto the control queue, d(x) = d(p(x)) + 1 (and at that time p(x) is the front of the queue). Show by mathematical induction that d values are non-decreasing for all vertices in the queue. Because d values are never changed once a vertex is pushed, if x is pushed before y then d(x) ≤ d(y). Lemma 7. d(x) = δ(s, x) for all reachable x. Proof. Suppose that the result fails. Let δ be the smallest shortest-path-distance for which the result fails, and let y be a vertex for which d(y) > δ(s, y) = δ. Let x be the next-tolast vertex on a shortest path from s to y. Then δ(s, y) = 1 + δ(s, x) and, because of the minimality of δ = δ(s, y), d(x) = δ(s, x). We summarize what we know so far: d(y) > δ(s, y) = 1 + δ(s, x) = 1 + d(x) Now consider the three possible colors of y at the times x is at the front of the control queue. If y is white, then y will be pushed onto conQ while x is at the front, making d(y) = d(x)+1, a contradiction. If y is black, it has been popped and d(y) ≤ d(x) by Lemma 6, again a contradiction. If y is gray, then d(y) ≤ d(x) + 1 by Lemma 4, a contradiction yet again. Therefore under all possibilities our original assumption is false. Putting these facts together we have: Theorem 6 (Breadth-First Tree Theorem). Suppose BFSurvey::Search(s) has been called for the graph or digraph G = (V, E). Then (1) For each reachable vertex x ∈ V , d(x) is the shortest-path-distance from s to x; and (2) The breadth-first tree contains a shortest path from s to x. 4.3 Interpreting DFSurvey The time stamps on vertices during a DFSurvey::Search() provide a way to codify the effects of LIFO order in the control system for DFS. We have already observed that time stamps are unique, and one stamp is used for each change of color of a vertex. Let us define td (x) = dtime[x] and tf (x) = ftime[x] for each vertex x. Inspection of the algorithm shows that discovery occurs before finishing: Lemma 8. For each vertex x, td (x) < tf (x). 23 Therefore the interval [td (x), tf (x)] represents the time values for which x is in the control LIFO, that is, the times when x has color gray. Prior to td (x), x is white, and after tf (x), x is black. Theorem 7 (Parenthesis Theorem). Assume G = (V, E) is a (directed or undirected) graph and that DFSurvey::Search() has been run on G. Then for two vertices x and y, exactly one of the following three conditions holds: (1) The time intervals [td (x), tf (x)] and [td (y), tf (y)] are disjoint, and x and y belong to different trees in the DFS forest. (2) [td (x), tf (x)] is a subset of [td (y), tf (y)], and x is a descendant of y in the forest. (3) [td (x), tf (x)] is a superset of [td (y), tf (y)], and x is an ancester of y in the forest. Proof. First suppose x and y belong to different trees in the forest. Then x is discovered during one call Search(v) and y is discovered during a different call Search(w) where v 6= w. Then x is colored gray and then black during Search(v), and y is colored gray and then black during Search(w). Clearly these two processes do not overlap in time, and condition (1) holds. Suppose on the other hand that x and y are in the same tree in the search forest. Without loss of generality we assume y is a descendant of x. Then, by inspection of the algorithm, x must be colored gray before y. Hence, td (x) < td (y). But due to the LIFO order of processing, this means that y is colored black before x. Therefore tf (y) < tf (x). That is, [td (y), tf (y)] is a subset of [td (x), tf (x)], and condition (3) holds. A symmetric argument completes the proof. Theorem 8 (White Path Theorem). In a depth-first forest of a directed or undirected graph G = (V, E), vertex y is a descendant of vertex x iff at the discovery time td (x) there is a path from x to y consisting entirely of white vertices. Proof. First note that discovery times td (x) = dtime[x] are stamped prior to any processing of x in the DFSurvey::Search algorithm. First suppose z is a descendant of x. If z = x then {x} is a white path. If z 6= x then td (x) < td (z) by the Parenthesis Theorem, so z is white at time td (x). Applying the observation to any z in the DFS tree path from x to y shows that the DFS tree path from x to y consists of white vertices. Conversely, suppose at time td (x) there is a path from x to y consisting entirely of white vertices. If some vertex in this path is not a descendant of x, let v be the one closest to x with this property. Then the predecessor u on the path is a descendant of x. At time td (u), v is white and an unvisited adjacent of u, so v will be discovered and p(v) = u. That is, v is a descendant of u, and hence of x, contradicting the assumption that v is not a descendant of x. Therefore every vertex on the white path is a descendant of x. 24 4.4 Classification of Edges The surveys can be used to classify edges of a graph or directed graph. We will use DFSSurvey for this purpose. Given an edge, there are four possibilities: (1) it is in the DFS Forest; it goes from x to another vertex in the same tree, either (2) an ancester or (3) a descendant; or (4) it goes to a vertex that is neither ancester nor descendant, whether in the same or a different tree. 1. Tree edges are edges in the depth-first forest. 2. Back edges are edges (x, y) connecting a vertex x in the DFS forest to an ancester y in the same tree of the forest. 3. Forward edges are edges (x, y) connecting a vertex x in the DFS forest to a descendant y in the same tree of the forest. 4. Cross edges are any other edges. These might go to another vertex in the same tree or a vertex in a different tree. For an undirected graph, this classification is based on the first encounter of the edge in the DFSurvey. Note these observations relating the color of the terminal vertex of an edge to the edge classification. Suppose e = (x, y) is an edge of G, and consider the moment in (algorithm) time when e is explored. Then: 1. If y is white then e is a tree edge. 2. If y is gray then e is a back edge. 3. If y is black then e is a forward or cross edge. Theorem 9. In a depth-first survey of an undirected graph G, every edge is either a tree edge or a back edge. Proof. Let e = (x, y) be an edge of G. Since G is undirected, e = (y, x) as well, so we can assume that x is discovered before y. At time td (x), y is white. Suppose e is first explored from x. Then y is white at the time, and hence e becomes a tree edge. If e is first explored from y, then x is gray at the time, and e is a back edge. Theorem 10. A directed graph D contains no directed cycles iff a depth-first search of D yields no back edges. Proof. If DFS produces a back edge (x, y), adding that edge to the DFS tree path from x to y creates a cycle. 25 If D has a (directed) cycle C, let y be the first vertex discovered in C, and let (x, y) be the preceding edge in C. At time td (y), the vertices of C form a white path from y to x. By the white path theorem, x is a descendant of y, so (x, y) is a back edge. 5 Spinoffs from BFS and DFS If theorems have corollaries, do algorithms have cororithms? Maybe, but that is difficult to speak. “Spinoff” is very informal term meaning an extra outcome or simple modification of the algorithm that requires little or no extra verification or anaylsis. 5.1 Components of a Graph Suppose G = (V, E) is an undirected graph. G is called connected iff for every pair x, y ∈ V of vertices there is a path in G from x to y. A component of G is a graph C such that (1) C is a subgraph of G, (2) C is connected, and (3) C is maximal with respect to (1) and (2) The technology developed in Section 4.3 shows that the following instantiation of the DFS algotithm produces a vector component<unsigned> such that component[x] is the component containing x for each vertex x of G. All that is needed is to declare the components vector and define PostProcess and make a small adjustment to DFSurvey::Search(): void DFSurvey::Search() { unsigned components = 0; for (each vertex v of g_) if (color[v] == white) { components +=1; Search(v); } } PostProcess(f) { component[f] = components; } Recall that we know the DFS forest is a collection of trees, each tree generated by a call to Search(v). The DFS trees are in 1-1 correspondence to the components of G. The algorithm above counts the components and assigns each vertex its component number as it is processed. 26 This is an algorithm that runs in time Θ(|V | + |E|) and computes a vector that, subsequently, looks up the component of any vertex in constant time. The algorithm doesn’t need vertex color (equate color white with unvisited), only the minimum control structure variables. 5.2 Topological Sort A directed graph is acyclic if it has no (directed) cycles. A directed acyclic graph is called a DAG for short. DAGs occur naturally in various places, such as: vertices cells in a spreadsheet targets in a makefile courses in a curriculum directed edge dependancy from another cell dependency on another target course pre-requisit In these and other models, it is important to know what order to consider the vertices. For example, courses need to be taken respecting the pre-requisit structure, make needs to build the targets in dependency order, and a spreadsheet cell should be calculated only after the cells on which it depends have been calculated. A topological sort of a DAG is an ordering of its vertices in such a way that all edges go from lower to higher vertices in the ordering. DFS can be used to extract a topological sort from a DAG, as follows: TopSort Modifies DFSurvey Uses double-ended queue outQ Run DFSurvey(D) As a vertex x is finished, outQ.PushFront(x) // LIFO - reverses order Then outQ [front to back] is a topological sort of D Theorem 10 does the heavy lifting to show the correctness of TopSort. 27 Exercises 1. Conversions between directed and undirected graphs. (a) Consider an undirected graph G = (V, E) represented by either an adjacency matrix or an adjacency list. What changes to the representations are made when G is converted to a directed graph? Explain. (b) Consider a directed graph D = (V, E) represented by either an adjacency matrix or an adjacency list. What changes to the representations are made when D is converted to an undirected graph? Explain. 2. Find the appropriate places in the Graph hierarchy to re-define each of the virtual methods named in the Graph base class, and provide the implementations. 3. The way vertices are stored in adjacency lists has an arbitrary effect on the order in which they are processed by DFS and BFS. (a) Explain these effects. (b) How might the graph edge insertion operations be modified to enforce encountering vertices in numerical order? 4. Prove: During BFSurvey::Search() on a graph G, if x and y are both gray vertices with x colored gray before y, then d(x) ≤ d(y) ≤ d(x) + 1. (This is Lemma 5 above.) 5. Consider an alternative topological sort algorithm offered first by Donald Knuth: TopSort2 Operates on: Directed Graph D = (V,E) Uses: vector <unsigned> id indexed on vertices FIFO queue conQ double-ended queue outQ for each vertex x, id[x] = inDegree(x); for each vertex x if id[x] = 0 conQ.Push(x); While (!conQ.Empty()) { f = conQ.Front(); conQ.Pop(); for every neighbor n of f { --id[f]; if (id[f] == 0) conQ.Push(n); // front or back? } outQ.Push (t); // front or back? } if (outQ.Size() == |V|) export outQ as a topological sort of D (front to back) else export ‘‘D has a cycle’’ 28 (a) Show that TopSort2 produces a topological sort iff D is acyclic. (b) The two push operations were not specified as FIFO (push back) or LIFO (push front). Which choices will ensure that TopSort2 produces the same topological sort as TopSort? (c) Use aggregate analysis to derive and verify the worst case runtime for TopSort2. Software Engineering Projects 6. Develop the graph class hierarchy as outlined in Section 1.4 above. Be sure to provide adjacency iterators facilitating BFS and DFS implementations. 7. Implement BFSurvey and DFSurvey operating on graphs and digraphs via the API provided by the hierarchy above. 8. Develop two classes BFSIterator and DFSIterator that may be used by UnGraphList and DiGraphList. The goal is that these traversal loops are defined for the graph/digraph g: for (BFSIterator i.Initialize(g,v); !i.Finished(); ++i) { std::cout << *i; } for (DFSIterator i.Initialize(g,v); !i.Finished(); ++i) { std::cout << *i; } and accomplish BFSurvey::Search(v) and DFSurvey::Search(v), respectively, and output the vertex number in discovery order. Of course, the traversals defined with iterators may be stopped, or paused and restarted, in the client program. The iterators should provide access to all of the public survey information.