Chapter 4 Variable–Length and Huffman Codes Unique Decodability We must always be able to determine where one code word ends and the next one begins. Counterexample: Suppose: s1 = 0; s2 = 1; s3 = 11; s4 = 00 0011 = s4s3 or s1s1s3 Unique decodability means that any two distinct sequences of symbols (of possibly differing lengths) result in distinct code words. Let S s1 , , s q be the set of source symbols, and n n th S S S s i1 s i n | s i j S be the n direct product of S . These are the sequences from S of length n 4.1, 2 Instantaneous Codes s1 = 0 s2 = 10 s3 = 110 s4 = 111 1 0 s1 1 0 s2 1 0 s3 s4 decoding tree No code word is the prefix of another. By reading a continuous sequence of code words, one can instantaneously determine the end of each code word. Consider the reverse: s1 = 0; s2 = 01; s3 = 011; s4 = 111 0111……111 is uniquely decodable, but the first symbol cannot be decoded without reading all the way to the end. 4.3 Constructing Instantaneous Codes comma code: modification: s1 = 0 s2 = 10 s3 = 110 s4 = 1110 s5 = 1111 s1 = 00 s2 = 01 s3 =10 s4 = 110 s5 = 111 Decoding tree 0 0 s1 = 00 Notice that every code word is located on the leaves 1 s2 = 01 1 0 s3 = 10 1 0 s4 = 110 1 s5 = 111 4.4 Kraft Inequality Theorem: There exists an instantaneous code for S where each symbol s S is encoded in radix r with length |s| if and 1 only if s S r s 1 Proof: () By induction on the height (maximal length path) of the decoding tree, max{|s|: s S}. For simplicity, pick r = 2 (the binary case). By IH, the leaves of T0, T1 satisfy the Kraft inequality. Basis: n = 1 0,1 s1 1 2 or Induction: n > 1 0 s1 1 0 1 s2 1 2 1 T0 Could use n = 0 here! s T 0 1 <n 1 2 1 2 s Prefixing one symbol at top of tree increases all the lengths by one, so <n 1 s T1 1 2 s T1 1 1 2 1 s T 0 2 s 1 1 2 1 s T1 2 s 1 1 4.5 Same argument for radix r: Induction: n > 1 Basis: n = 1 ≤ r 1 0 …… 0 …… ≤ r1 s1 ………… s≤r T0 at most r r 1 r i 1 1 1 T≤r-1 at most r subtrees s T i 1 r s 1 1 r s T i 1 r s 1 r IH so adding at most r of these together gives ≤ 1 Inequality in the binary case implies that not all internal nodes have degree 2, but if a node has degree 1, then clearly that edge can be removed by contraction. 4.5 Kraft Inequality () Construct a code via decoding trees. Number the symbols s1, …, sq so that l1 ≤ … ≤ lq and assume K ≤ 1. Greedy method: proceed left-to-right, systematically assigning leaves to code words, so that you never pass through or land on a previous one. The only way this method could fail is if it runs out of nodes (tree is over-full), but that would mean K > 1. Exs: r=2 1, 3, 3, 3 0 1 0 1, 2, 3, 3 0 1 0 r=2 1 1 0 1 0 1 not used ½+⅛+⅛+⅛<1 1, 2, 2, 3 0 0 1 0 r=2 ½+¼+⅛+⅛=1 1 ½+¼+¼+⅛>1 4.5 Shortened Block Codes With exactly 2m symbols, we can form a set of code words each of length m : b1 …… bm bi {0,1}. This is a complete binary decoding tree of depth m. With < 2m symbols, we can chop off branches to get modified (shortened) block codes. 0 1 s1 0 0 s2 1 0 1 Ex 1 1 0 1 0 s3 0 1 s1 1 s4 s5 s3 0 1 Ex 2 s2 s4 s5 4.6 McMillan Inequality Idea: Uniquely decodable codes satisfy the same bounds as instantaneous codes. Theorem: Suppose we have a uniquely decodable code in radix r of lengths of l1 ≤ … ≤ lq . Then their Kraft sum is ≤ 1. q Proof : Let K 1 r i 1 li nl q , and consider K n k n Nk r k Use a multinomial expansion to see that Nk = the number of ways n l‘s can add up to k, which is the same as the number of different ways n symbols can form a coded message of length k. Because of uniqueness, this must be ≤ rk, the number of codewords. nl q K n r r k k nl q n 1 nl q ( n 1) k n But K 1 n K n nl q ›‹ K 1 Conclusion: WLOG we can use only instantaneous codes. 4.7 Average code length q L avg pl i i where i 1 p i probabilit y of symbol s i l i s i the length of s i Our goal is to minimize the average coded length. If pn > pm then ln ≤ lm. For if pm < pn with lm < ln, then interchanging the encodings for sm and sn we get L avg p m l m p n l n ( p m l n p n l m ) ( p m p n )( l m l n ) 0 . old > new So we can assume that if p1 ≥ … ≥ pq then l1 ≤ … ≤ lq, because if pi = pi+1 with li > li+1, we can just switch si and si+1. 4.8 Start with S = {s1, …, sq} the source alphabet. And consider B = {0, 1} as our code alphabet (binary). First, observe that lq1 = lq, since the code is instantaneous, s<q cannot be a prefix of sq, so dropping the last symbol from sq (if lq > lq1) won’t hurt. Huffman algorithm: So, we can combine sq1 and sq into a “combo-symbol” (sq1+sq) with probability (pq1+pq) and get a code for the reduced alphabet. For q = 1, assign s1 = ε . For q > 1, let sq-1 = (sq-1+sq) 0 and sq = (sq-1+sq) 1 Example: 0.4 0.2 0.2 0.1 0.1 0.4 0.2 0.2 0.2 0.4 0.4 0.2 0.6 0.4 1 01 000 0010 0011 1 01 000 001 1 00 01 0 1 1.0 N. B. the case for q = 1 does not produce a valid code. ε 4.8 Huffman is always of shortest average length Huffman ≥ Assume p1 ≥ … ≥ pq Lavg Alternative trying to show L Example: p1 = 0.7; p2 = p3 = p4 = 0.1 Compare Lavg = 1.5 to log2 q = 2. Base Case: For q = 2, no shorter code exists. We know l1 ≤ … ≤ lq 0 s1 1 s2 Induction Step: For q > 2 take any instantaneous code for s1, …, sq with minimal average length. 4.8 Claim that lq1 = lq = lq1, q + 1 because combined symbol reduced code sq1 + sq 0 s1, ……… total height = lq 1 sq1 sq So its reduced code will always satisfy: q2 L p i l i ( p q 1 p q )( l q 1) L p q 1 p q i 1 By IH, L′avg ≤ L′. But more importantly the reduced Huffman code shares the same properties so it also satisfies the same equation L′avg + (pq1 + pq) = Lavg, hence Lavg ≤ L. 4.8 Code Extensions Take p1 = ⅔ and p2 = ⅓ Huffman code gives s1 = 0 s2 = 1 Lavg = 1 Square the symbol alphabet to get: S2 : s1,1 = s1s1; s1,2 = s1s2; s2,1 = s2s1; s2,2 = s2s2; p1,1 = 4⁄9 p1,2 = 2⁄9 p2,1 = 2⁄9 p2,2 = 1⁄9 Apply Huffman to S2: s1,1 = 1; s1,2 = 01; s2,1 = 000; s2,2 = 001 L avg 1 2 2 1 17 2 3 3 2 9 9 9 9 9 4 But we are sending two symbols at a time! 4.10 Huffman Codes in radix r At each stage down, we merge the last (least probable) r states into 1, reducing the # of states by r 1. Since we end with one state, we must begin with no. of states 1 mod (r 1) . We pad out states with probability 0 to get this. Example: r = 4; k = 3 0.22 0.2 0.18 0.15 0.1 0.08 0.05 0.02 0.0 pads 0.0 0.22 0.2 0.18 0.15 0.1 0.08 0.07 0.4 0.22 0.2 0.18 1 2 3 00 01 02 030 031 1.0 1 2 3 00 01 02 03 0 1 2 3 4.11