Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V • Huffman Codes Adam Smith 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: • Let e be the cheapest edge in G. Some MST of G contains e? • Let e be the most expensive edge in G. No MST of G contains e? 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Review • Exercise: given an undirected graph G, consider spanning trees produced by four algorithms – – – – BFS tree DFS tree shortest paths tree (Dijsktra) MST • Find a graph where – all four trees are the same – all four trees must be different (note: DFS/BFS may depend on exploration order) 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Non-distinct edges? • Read in text 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Implementing MST algorithms • Prim: similar to Dijkstra • Kruskal: – Requires efficient data structure to keep track of “islands”: Union-Find data structure – We may revisit this later in the course • You should know how to implement Prim in O(m logm/n n) time 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Implementation of Prim(G,w) IDEA: Maintain V – S as a priority queue Q (as in Dijkstra). Key each vertex in Q with the weight of the leastweight edge connecting it to a vertex in S. QV key[v] for all v V key[s] 0 for some arbitrary s V while Q do u EXTRACT-MIN(Q) for each v Adjacency-list[u] do if v Q and w(u, v) < key[v] then key[v] w(u, v) ⊳ DECREASE-KEY p[v] u At the end, {(v, p[v])} forms the MST. 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Analysis of Prim QV Q(n) key[v] for all v V total key[s] 0 for some arbitrary s V while Q do u EXTRACT-MIN(Q) for each v Adj[u] n do if v Q and w(u, v) < key[v] times degree(u) times then key[v] w(u, v) p[v] u Handshaking Lemma Q(m) implicit DECREASE-KEY’s. Time: as in Dijkstra 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Analysis of Prim n times while Q do u EXTRACT-MIN(Q) for each v Adj[u] degree(u) do if v Q and w(u, v) < key[v] times then key[v] w(u, v) p[v] u Handshaking Lemma Q(m) implicit DECREASE-KEY’s. PQ Operation Prim Array Binary heap d-way Heap Fib heap † ExtractMin n n log n HW3 log n DecreaseKey Total m 1 n2 log n m log n HW3 m log m/n n 1 m + n log n † Individual ops are amortized bounds 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Greedy Algorithms for MST •Kruskal's: Start with T = . Consider edges in ascending order of weights. Insert edge e in T unless doing so would create a cycle. •Reverse-Delete: Start with T = E. Consider edges in descending order of weights. Delete edge e from T unless doing so would disconnect T. •Prim's: Start with some root node s. Grow a tree T from s outward. At each step, add to T the cheapest edge e with exactly one endpoint in T. 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Union-Find Data Structures Operation\ Implementation Array + linked-lists and sizes Balanced Trees Find (worst-case) ϴ(1) ϴ(log n) Union of sets A,B (worst-case) ϴ(min(|A|,|B|) (could be as large as ϴ(n) ϴ(log n) Amortized analysis: k unions ϴ(k log k) and k finds, starting from singletons ϴ(k log k) •With modifications, amortized time for tree structure is O(n Ack(n)), where Ack(n), the Ackerman function grows much more slowly than log n. •See KT Chapter 4.6 9/15/2008 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Huffman codes 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Prefix-free codes • Binary code maps characters in an alphabet (say {A,…,Z}) to binary strings • Prefix-free code: no codeword is a prefix of any other – ASCII: prefix-free (all symbols have same length) – Not prefix-free: • • • • • a0 b1 c 00 d 01 … • Why is prefix-free good? 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne A prefix-free code for a few letters • e.g. e 00, p 10011 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Source: WIkipedia A prefix-free code • e.g. T 1001, U 1000011 9/10/10 Source: Jeff Erickson notes. A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne How good is a prefix-free code? • Given a text, let f[i] = # occurrences of letter i • Total number of symbols needed • How do we pick the best prefix-free code? 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Huffman’s Algorithm (1952) • Given individual letter frequencies f[1, .., n]: – Find the two least frequent letters i,j – Merge them into symbol with frequency f[i]+f[j] – Repeat • e.g. – – – – – 9/10/10 a: 6 b: 6 c: 4 d: 3 e: 2 Theorem: Huffman algorithm finds an optimal prefix-free code A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Warming up • Lemma 0: Every optimal prefix-free code corresponds to a full binary tree. – (Full = every node has 0 or 2 children) • Lemma 1: Let x and y be two least frequent characters. There is an optimal code in which x and y are siblings. 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Huffman codes are optimal Proof by induction! • Base case: two symbols; only one full tree. • Induction step: – – – – – 9/10/10 Suppose f[1], f[2] are smallest in f[1,…,n] T is an optimal code for {1,…,n} Lemma 1 ==> can choose T where 1,2 are siblings. New symbol numbered n+1, with f[n+1] = f[1]+f[2] T’ = code obtained by merging 1,2 into n+1 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Cost of T in terms of T’: • • • • 9/10/10 Let H be Huffman code for {1,…,n} Let H’ be Huffman code for {3,…,n+1} Induction hypothesis cost(H’) ≤ cost(T’) cost(H) = cost(H’)+f[1]+f[2] ≤ cost(T). QED A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Notes • See Jeff Erickson’s lecture notes on greedy algorithms: – http://theory.cs.uiuc.edu/~jeffe/teaching/algorithms/ 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Data Compression for real? • Generally, we don’t use letter-by-letter encoding • Instead, find frequently repeated substrings – Lempel-Ziv algorithm extremely common – also has deep connections to entropy • If we have time for string algorithms, we’ll cover this… 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Huffman codes and entropy • Given a set of frequencies, consider probabilities p[i] = f[i] / (f[1] + … + f[n]) • Entropy(p) = Σi p[i] log(1/p[i]) • Huffman code has expected depth Entropy(p) ≤ Σi p[i] depth(i) ≤ Entropy(p) +1 • To prove the upper bound, find some prefix free code where – depth(i) ≤ log(1/p[i]) +1 for every symbol i – Exercise! • The bound applies to Huffman too, by optimality 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Prefix-free encoding of arbitrary length strings • What if I want to send you a text – But you don’t know ahead of time how long it is? • 1: put length at the beginning: n+log(n) bits • requires me to know the length • 2: every B bits, put a special bit indicating whether or not we’re done: n(1+1/B) +B-1 bits • Can we do better? 9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne