CSCI 256 Data Structures and Algorithm Analysis Lecture 9 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some by Iker Gondra Shortest Path Problem • Negative Cost Edges – Dijkstra’s algorithm assumes positive cost edges – For some applications, negative cost edges make sense – Shortest path not well defined if a graph has a negative cost cycle – Bellman-Ford algorithm finds shortest paths in a graph with negative cost edges (or reports the existence of a negative cost cycle). a 6 4 -4 -3 s 4 e c -2 3 3 6 2 g b f 7 4 Minimum Spanning Tree • Minimum spanning tree: Given a connected graph G = (V, E) with real-valued edge weights ce, a MST is a subset of the edges T ⊆ E such that T is a spanning tree (tree which spans G) whose sum of edge weights is minimized. 24 4 23 6 16 4 18 5 9 5 11 8 14 10 9 6 7 8 11 7 21 G = (V, E) T, Σe∈T ce = 50 Applications • MST is a fundamental problem with diverse applications – Network design • telephone, electrical, hydraulic, TV cable, computer, road – Approximation algorithms for NP-hard problems • traveling salesperson problem, Steiner tree – Indirect applications • • • • • • • max bottleneck paths LDPC codes for error correction image registration with Renyi entropy learning salient features for real-time face verification reducing data storage in sequencing amino acids in a protein model locality of particle interactions in turbulent fluid flows autoconfig protocol for Ethernet bridging to avoid cycles in a network – Cluster analysis Greedy Algorithms • Kruskal's algorithm – Start with T = φ. Consider edges in ascending order of cost. Insert edge e in T unless doing so would create a cycle • Reverse-Delete algorithm – Start with T = E. Consider edges in descending order of cost. Delete edge e from T unless doing so would disconnect T • Prim's algorithm – Start with some root node s and greedily grow a tree T from s outward. At each step, add the cheapest edge e to T that has exactly one endpoint in T • Remark: All three algorithms produce a MST Greedy Algorithm 1: Kruskal’s Algorithm • Add the cheapest edge that joins disjoint components 15 t a Label the edges in order of insertion 13 s 17 Construct the MST with Kruskal’s algorithm 14 9 3 10 1 4 e c 20 2 5 7 b u 6 8 12 16 v 11 g f 22 Greedy Algorithm 2: Reverse-Delete Algorithm • Delete the most expensive edge that does not disconnect the graph 15 t a Label the edges in order of removal 13 s 17 Construct the MST with the reversedelete algorithm 14 9 3 10 1 4 e c 20 2 5 7 b u 6 8 12 16 v 11 g f 22 Greedy Algorithm 3: Prim’s Algorithm • Extend a tree by including the cheapest out going edge 15 t a Construct the MST with Prim’s algorithm starting from vertex a Label the edges in order of insertion 14 9 3 10 13 s 17 1 4 e c 20 2 5 7 b u 6 8 12 16 v 11 g f 22 Why do the greedy algorithms work? • All these algorithms work by repeatedly inserting or deleting edges from a partial solution – Thus to analyze these algorithms, it would be useful to have in hand some basic facts saying when it is “safe” to include an edge in the MST or when it is “safe” to eliminate an edge on the grounds that it couldn’t possible be in the MST • For simplicity, assume all edge costs are distinct. Thus, we can refer to “the MST” When is it safe to include an edge in the MST? • Edge inclusion lemma (also called the “Cut property”) Let S be a subset of V, and suppose e = (u, v) is the minimum cost edge of E, with u in S and v in V-S. Then e is in every MST T of G. e S V-S Proof: (we show the contrapositive) • Suppose T is a spanning tree that does not contain e. We need to show that T does not have the minimum possible cost • We do this using an exchange argument – we will identify an edge e1 in T that is more expensive than e and with the property that exchanging e for e1 results in a spanning tree that is cheaper than T • The crux is to find this e1 Proof: (we show the contrapositive) • Edge e is incident to v (in S) and w (in V-S); T is a spanning tree so there is a path P in T from v to w. Starting at v follow the nodes in sequence until we get the first node w’ in V-S. Let v’ be the node just before w’ in P and let e1 be (v’,w’). • Consider: T’ = T – {e1} + {e} • We can show that: – T’ is a spanning tree (show it is connected and acyclic) – T’ has lower cost • Proof (we show the contrapositive) • Easy to see that T’ is connected; • Only cycle in T’ + {e1} must be composed of e and the path P so if we remove e1 we have an acyclic subgraph • e is the minimum cost edge between S and V-S e is the minimum cost edge between S and V-S e1 S e V-S • T’ = T – {e1} + {e} is a spanning tree with lower cost than T (as we have exchanged the more expensive e1 • Hence, T is not a minimum spanning tree Optimality Proofs • Prim’s Algorithm computes a MST • Kruskal’s Algorithm computes a MST • Idea of both proofs: Show that when an edge is added to the MST by Prim or Kruskal, the edge is the minimum cost edge between S and V-S for some the set S of nodes (which increases with each addition of edges until it equals V) Prim’s Algorithm (grow a tree, T) S = { s }; T = { }; while S != V choose the minimum cost edge e = (u,v), with u in S, and v in V-S add e to T add v to S Prove Prim’s algorithm computes an MST (1) The algorithm only adds edges belonging to every MST. – On each iteration there is a set S, which is a subset of V on which a partial spanning tree has been constructed and a node v and edge e have been added to minimize min(u in S: e = (u,v)) ce. By definition e is the cheapest edge with one end in S and the other in V-S so by the Cut Property it is in every minimum spanning tree of G. (2) The algorithm produces a spanning tree - Clear Kruskal’s Algorithm (grow bigger connected sets, with the minimum cost edge available) Let C = { C1 ={v1}, C2 = {v2}, . . ., Cn = {vn} }; T = { } while |C| > 1 Let e = (u, v) with u in Ci and v in Cj be the minimum cost edge joining (the disjoint and disconnected) sets in C Replace Ci and Cj by their union C’I Add e to T Prove Kruskal’s algorithm computes a MST (1) An edge e is in the MST when it is added to T. – Since sets we begin with are disjoint and as we find edges between any two we redefine the sets so they remain disjoint from each other, this follows by the “Cut Property” – (2) The process continues until there is only one connected set containing all the vertices – so the set spans G When can we guarantee an edge is not in the MST? • Cycle Property – The most expensive edge on a cycle is never in a MST e1 S e V-S e is the most expensive edge on a cycle involving S and V-S – Optimality of Reverse-Delete algorithm follows from this Proof of the Cycle Property (also uses an exchange argument!) Proof: Suppose C is a cycle and e = (v,w) is its most expensive edge. We proceed by contradiction: • Assume e is in a MST T of G. • If we delete e, we partition the nodes of T into two sets, S and V – S, with v in S and w in V-S. • Since we began with a cycle, there must be another edge e’ with one end in S and one end in V-S. e was the most expensive edge, so e’ is cheaper. We exchange e for e’ in resulting in T’. • T’ spans G and its cost is less that T. • This contradicts fact that T was a MST of G Dealing with the assumption of no equal weight edges • Force the edge weights to be distinct – Add small quantities to the weights – Give a tie breaking rule for equal weight edges Clustering • Clustering: Given a set U of n objects labeled p1, …, pn, classify into coherent groups e.g., photos, documents. micro-organisms • Distance function: Numeric value specifying "closeness" of two objects e.g., number of corresponding pixels whose intensities differ by some threshold • Fundamental problem: Divide into clusters so that points in different clusters are far apart – Identify patterns in gene expression – Document categorization for web search – Similarity searching in medical image databases Clustering of Maximum Spacing • Distance function: Assume it satisfies several natural properties – d(pi, pj) = 0 iff pi = pj – d(pi, pj) ≥ 0 – d(pi, pj) = d(pj, pi) (identity of indiscernibles) (nonnegativity) (symmetry) • Spacing: Min distance between any pair of points in different clusters • Clustering of maximum spacing: Given integer k, find a k-clustering of maximum spacing spacing k=4 Divide into 2 clusters Divide into 3 clusters Divide into 4 clusters Greedy Clustering Algorithm • Distance clustering algorithm – Form a graph on the vertex set U as follows: (where the connected components are the clusters -- without any edges you would have n clusters) – First draw an edge between the closest pair of points, then draw an edge between the next closest pair of points and keep adding edges between pairs of points of increasing d(pi,pj). The connected components correspond to clusters, no need to add edge between any pairs of points in the same cluster (thus avoiding cycles) – Repeat until there are exactly k clusters • Key observation: This procedure is precisely Kruskal's algorithm (except we stop when there are k connected components) • Remark: Equivalent to finding a MST and deleting the k-1 most expensive edges (if we take away k-1 edges from a spanning tree we will then leave k connected components) Distance Clustering Algorithm – like Kruskal’s Algorithm Let C = {{v1}, {v2},. . ., {vn}}; T = { } while |C| > k Let e = (u, v) with u in Ci and v in Cj be the joining disjoint sets in C Replace Ci and Cj by C’i = Ci U Cj minimum cost edge K-clustering More Greedy Algorithms: Coin Changing vs Stamp Buying • Goal: Given currency denominations: 1, 5, 10, 25, 100, devise a method to pay amount to customer using fewest number of coins – Ex: 34¢ • Cashier's algorithm: At each iteration, add coin of the largest value that does not take us past the amount to be paid – Ex: $2.89 • Theorem: Greedy is optimal for U.S. coinage: 1, 5, 10, 25, 100 • Question: Is Greedy algorithm is optimal for US postal denominations: 1, 10, 21, 34, 70, 100, 350, 1225, 1500?