CSCI 256
Data Structures and Algorithm Analysis
Lecture 9
Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley
all rights reserved, and some by Iker Gondra
Shortest Path Problem
• Negative Cost Edges
– Dijkstra’s algorithm assumes positive cost edges
– For some applications, negative cost edges make sense
– Shortest path not well defined if a graph has a negative cost
cycle
– Bellman-Ford algorithm finds shortest paths in a graph with
negative cost edges (or reports the existence of a negative cost
cycle).
a
6
4
-4
-3
s
4
e
c
-2
3
3
6
2
g
b
f
7
4
Minimum Spanning Tree
• Minimum spanning tree: Given a connected graph G =
(V, E) with real-valued edge weights ce, a MST is a
subset of the edges T ⊆ E such that T is a spanning tree
(tree which spans G) whose sum of edge weights is
minimized.
24
4
23
6
16
4
18
5
9
5
11
8
14
10
9
6
7
8
11
7
21
G = (V, E)
T, Σe∈T ce = 50
Applications
• MST is a fundamental problem with diverse applications
– Network design
• telephone, electrical, hydraulic, TV cable, computer, road
– Approximation algorithms for NP-hard problems
• traveling salesperson problem, Steiner tree
– Indirect applications
•
•
•
•
•
•
•
max bottleneck paths
LDPC codes for error correction
image registration with Renyi entropy
learning salient features for real-time face verification
reducing data storage in sequencing amino acids in a protein
model locality of particle interactions in turbulent fluid flows
autoconfig protocol for Ethernet bridging to avoid cycles in a
network
– Cluster analysis
Greedy Algorithms
• Kruskal's algorithm
– Start with T = φ. Consider edges in ascending order of cost.
Insert edge e in T unless doing so would create a cycle
• Reverse-Delete algorithm
– Start with T = E. Consider edges in descending order of cost.
Delete edge e from T unless doing so would disconnect T
• Prim's algorithm
– Start with some root node s and greedily grow a tree T from s
outward. At each step, add the cheapest edge e to T that has
exactly one endpoint in T
• Remark: All three algorithms produce a MST
Greedy Algorithm 1:
Kruskal’s Algorithm
• Add the cheapest edge that joins disjoint
components
15
t
a
Label the edges in
order of insertion
13
s
17
Construct the MST
with Kruskal’s
algorithm
14
9
3
10
1
4
e
c
20
2
5
7
b
u
6
8
12
16
v
11
g
f
22
Greedy Algorithm 2:
Reverse-Delete Algorithm
• Delete the most expensive edge that does not
disconnect the graph
15
t
a
Label the edges in
order of removal
13
s
17
Construct the MST
with the reversedelete algorithm
14
9
3
10
1
4
e
c
20
2
5
7
b
u
6
8
12
16
v
11
g
f
22
Greedy Algorithm 3:
Prim’s Algorithm
• Extend a tree by including the cheapest out
going edge
15
t
a
Construct the MST
with Prim’s
algorithm starting
from vertex a
Label the edges in
order of insertion
14
9
3
10
13
s
17
1
4
e
c
20
2
5
7
b
u
6
8
12
16
v
11
g
f
22
Why do the greedy algorithms work?
• All these algorithms work by repeatedly inserting
or deleting edges from a partial solution
– Thus to analyze these algorithms, it would be useful
to have in hand some basic facts saying when it is
“safe” to include an edge in the MST or when it is
“safe” to eliminate an edge on the grounds that it
couldn’t possible be in the MST
• For simplicity, assume all edge costs are
distinct. Thus, we can refer to “the MST”
When is it safe to include an edge in the
MST?
• Edge inclusion lemma (also called the “Cut
property”)
Let S be a subset of V, and suppose e = (u, v) is the
minimum cost edge of E, with u in S and v in V-S. Then
e is in every MST T of G.
e
S
V-S
Proof: (we show the contrapositive)
• Suppose T is a spanning tree that does not
contain e. We need to show that T does not
have the minimum possible cost
• We do this using an exchange argument – we
will identify an edge e1 in T that is more
expensive than e and with the property that
exchanging e for e1 results in a spanning tree
that is cheaper than T
• The crux is to find this e1
Proof: (we show the contrapositive)
• Edge e is incident to v (in S) and w (in V-S); T is
a spanning tree so there is a path P in T from v
to w. Starting at v follow the nodes in sequence
until we get the first node w’ in V-S. Let v’ be the
node just before w’ in P and let e1 be (v’,w’).
• Consider: T’ = T – {e1} + {e}
• We can show that:
– T’ is a spanning tree (show it is connected and
acyclic)
– T’ has lower cost
•
Proof (we show the contrapositive)
• Easy to see that T’ is connected;
• Only cycle in T’ + {e1} must be composed of e and the
path P so if we remove e1 we have an acyclic subgraph
• e is the minimum cost edge between S and V-S
e is the
minimum cost
edge between
S and V-S
e1
S
e
V-S
• T’ = T – {e1} + {e} is a spanning tree with lower cost than
T (as we have exchanged the more expensive e1
• Hence, T is not a minimum spanning tree
Optimality Proofs
• Prim’s Algorithm computes a MST
• Kruskal’s Algorithm computes a MST
• Idea of both proofs: Show that when an edge is
added to the MST by Prim or Kruskal, the edge
is the minimum cost edge between S and V-S
for some the set S of nodes (which increases
with each addition of edges until it equals V)
Prim’s Algorithm (grow a tree, T)
S = { s };
T = { };
while S != V
choose the minimum cost edge
e = (u,v), with u in S, and v in V-S
add e to T
add v to S
Prove Prim’s algorithm computes an MST
(1) The algorithm only adds edges belonging to
every MST.
– On each iteration there is a set S, which is a subset of
V on which a partial spanning tree has been
constructed and a node v and edge e have been
added to minimize min(u in S: e = (u,v)) ce. By definition e is
the cheapest edge with one end in S and the other in
V-S so by the Cut Property it is in every minimum
spanning tree of G.
(2) The algorithm produces a spanning tree
- Clear
Kruskal’s Algorithm (grow bigger connected
sets, with the minimum cost edge available)
Let C = { C1 ={v1}, C2 = {v2}, . . ., Cn = {vn} }; T = { }
while |C| > 1
Let e = (u, v) with u in Ci and v in Cj be the
minimum cost edge joining (the disjoint and
disconnected) sets in C
Replace Ci and Cj by their union C’I
Add e to T
Prove Kruskal’s algorithm computes a MST
(1) An edge e is in the MST when it is added to T.
– Since sets we begin with are disjoint and as we find
edges between any two we redefine the sets so they
remain disjoint from each other, this follows by the
“Cut Property”
– (2) The process continues until there is only one
connected set containing all the vertices – so the set
spans G
When can we guarantee an edge is not in
the MST?
• Cycle Property
– The most expensive edge on a cycle is never in a
MST
e1
S
e
V-S
e is the most
expensive edge
on a cycle
involving S and
V-S
– Optimality of Reverse-Delete algorithm follows from
this
Proof of the Cycle Property (also uses an
exchange argument!)
Proof: Suppose C is a cycle and e = (v,w) is its most
expensive edge. We proceed by contradiction:
• Assume e is in a MST T of G.
• If we delete e, we partition the nodes of T into two sets,
S and V – S, with v in S and w in V-S.
• Since we began with a cycle, there must be another
edge e’ with one end in S and one end in V-S. e was the
most expensive edge, so e’ is cheaper. We exchange e
for e’ in resulting in T’.
• T’ spans G and its cost is less that T.
• This contradicts fact that T was a MST of G
Dealing with the assumption of no equal
weight edges
• Force the edge weights to be distinct
– Add small quantities to the weights
– Give a tie breaking rule for equal weight edges
Clustering
• Clustering: Given a set U of n objects labeled p1,
…, pn, classify into coherent groups
e.g., photos, documents. micro-organisms
• Distance function: Numeric value specifying
"closeness" of two objects
e.g., number of corresponding pixels whose
intensities differ by some threshold
• Fundamental problem: Divide into clusters so
that points in different clusters are far apart
– Identify patterns in gene expression
– Document categorization for web search
– Similarity searching in medical image databases
Clustering of Maximum Spacing
• Distance function: Assume it satisfies several natural
properties
– d(pi, pj) = 0 iff pi = pj
– d(pi, pj) ≥ 0
– d(pi, pj) = d(pj, pi)
(identity of indiscernibles)
(nonnegativity)
(symmetry)
• Spacing: Min distance between any pair of points in
different clusters
• Clustering of maximum spacing: Given integer k, find a
k-clustering of maximum spacing
spacing
k=4
Divide into 2 clusters
Divide into 3 clusters
Divide into 4 clusters
Greedy Clustering Algorithm
• Distance clustering algorithm
– Form a graph on the vertex set U as follows: (where the connected
components are the clusters -- without any edges you would have n
clusters)
– First draw an edge between the closest pair of points, then draw an
edge between the next closest pair of points and keep adding edges
between pairs of points of increasing d(pi,pj). The connected
components correspond to clusters, no need to add edge between any
pairs of points in the same cluster (thus avoiding cycles)
– Repeat until there are exactly k clusters
• Key observation: This procedure is precisely Kruskal's algorithm
(except we stop when there are k connected components)
• Remark: Equivalent to finding a MST and deleting the k-1 most
expensive edges (if we take away k-1 edges from a spanning tree
we will then leave k connected components)
Distance Clustering Algorithm – like
Kruskal’s Algorithm
Let C = {{v1}, {v2},. . ., {vn}}; T = { }
while |C| > k
Let e = (u, v) with u in Ci and v in Cj be the
joining disjoint sets in C
Replace Ci and Cj by C’i = Ci U Cj
minimum cost edge
K-clustering
More Greedy Algorithms:
Coin Changing vs Stamp Buying
• Goal: Given currency denominations: 1, 5, 10, 25, 100, devise a
method to pay amount to customer using fewest number of coins
– Ex: 34¢
• Cashier's algorithm: At each iteration, add coin of the largest value
that does not take us past the amount to be paid
– Ex: $2.89
• Theorem: Greedy is optimal for U.S. coinage: 1, 5, 10, 25, 100
• Question: Is Greedy algorithm is optimal for US postal
denominations: 1, 10, 21, 34, 70, 100, 350, 1225, 1500?