A Data Compression Algorithm: Huffman Compression Gordon College 1 2 Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa a a a a a a a a a a a a a a a a a a a a a a a a aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa aa aa aaaa a aa aa Uncompress Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa a Aaa aa aa aaa aa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Aaa aa aa aaa aa a a a aa aa aaaa Compress Aaa Aaa Aaa Aaa Aaa Aaa Aaa Aaa Aaa Aaa Aaa Aaa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa aa aa aa aa aa aa aa aa aa aa aa aa a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa • Definition: process of encoding which uses fewer bits • Reason: to save valuable resources such as communication bandwidth or hard disk space Compression Compression Types 1. Lossy – Loses some information during compression which means the exact original can not be recovered (jpeg) – Normally provides better compression – Used when loss is acceptable - image, sound, and video files 3 Compression Types 1. Lossless – exact original can be recovered – usually exploit statistical redundancy – Used when loss is not acceptable - data Basic Term: Compression Ratio - ratio of the number of bits in original data to the number of bits in compressed data For example: 3:1 is when the original file was 3000 bytes and the compression file is now only 1000 bytes. 4 Variable-Length Codes • Recall that ASCII, EBCDIC, and Unicode use same size data structure for all characters • Contrast Morse code QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. – Uses variable-length sequences • The Huffman Compression is a variable-length encoding scheme 5 Variable-Length Codes • Each character in such a code – Has a weight (probability) and a length • The expected length is the sum of the products of the weights Goal and lengths for all the characters minimize the expected length 0.2 x 2 + 0.1 x 4 + 0.1 x 4 + 0.15 x 3 + 0.45 x 1 = 2.1 6 Huffman Compression • Uses prefix codes (sequence of optimal binary codes) • Uses a greedy algorithm - looks at the data at hand and makes a decision based on the data at hand. 7 Huffman Compression Basic algorithm 1. Generates a table that contains the frequency of each character in a text. 2. Using the frequency table - assign each character a “bit code” (a sequence of bits to represent the character) 3. Write the bit code to the file instead of the character. 8 Immediate Decodability • Definition: When no sequence of bits that represents a character is a prefix of a longer sequence for another character Purpose: Can be decoded without waiting for remaining bits • Coding scheme to the right is not immediately decodable • However this one is 9 Huffman Compression • Huffman (1951) • Uses frequencies of symbols in a string to build a variable rate prefix code. – Each symbol is mapped to a binary string. – More frequent symbols have shorter codes. – No code is a prefix of another. 10 Huffman Codes • We seek codes that are – Immediately decodable – Each character has minimal expected code length • For a set of n characters { C1 .. Cn } with weights { w1 .. wn } – We need an algorithm which generates n bit strings representing the codes 11 Cost of a Huffman Tree • Let p1, p2, ... , pm be the probabilities for the symbols a1, a2, ... ,am, respectively. • Define the cost of the Huffman tree T to be where ri is the length of the path from the root to ai. • HC(T) is the expected length of the code of a symbol coded by the tree T. HC(T) is the bit rate of the code. 12 Example of Cost • Example: a 1/2, b 1/8, c 1/8, d 1/4 HC(T) = 1 x 1/2 + 3 x 1/8 + 3 x 1/8 + 2 x 1/4 = 1.75 a b c d 13 Huffman Tree • Input: Probabilities p1, p2, ... , pm for symbols a1, a2, ... ,am, respectively. • Output: A tree that minimizes the average number of bits (bit rate) to code a symbol. That is, minimizes where ri is the length of the path from the root to ai. This is a Huffman tree or Huffman code. 14 Recursive Algorithm - Huffman Codes 1. Initialize list of n one-node binary trees containing a weight for each character 2. Repeat the following n – 1 times: a. Find two trees T' and T" in list with minimal weights w' and w" b. Replace these two trees with a binary tree whose root is w' + w" and whose subtrees are T' and T" and label points to these subtrees 0 and 1 15 Huffman's Algorithm 3. The code for character Ci is the bit string labeling a path in the final binary tree from the root to Ci Given characters The end result is with codes 16 Huffman Decoding Algorithm 1. Initialize pointer p to root of Huffman tree 2. While end of message string not reached repeat the following: a. Let x be next bit in string b. if x = 0 set p equal to left child pointer else set p to right child pointer c. If p points to leaf i. Display character with that leaf ii. Reset p to root of Huffman tree 17 Huffman Decoding Algorithm • For message string 0101011010 – Using Hoffman Tree and decoding algorithm Click for answer 18 Iterative Huffman Tree Algorithm 1. Form a node for each symbol ai with weight pi; 2. Insert the nodes in a min priority queue ordered by probability; 3. While the priority queue has more than one element do – – – – – – – min1 := delete-min; min2 := delete-min; create a new node n; n.weight := min1.weight + min2.weight; n.left := min1; also associate this link with bit 0 n.right := min2; also associate this link with bit 1 insert(n) 4. Return the last node in the priority queue. 19 Example of Huffman Tree Algorithm (1) • P(a) =.4, P(b)=.1, P(c)=.3, P(d)=.1, P(e)=.1 20 Example of Huffman Tree Algorithm (2) 21 Example of Huffman Tree Algorithm (3) 22 Example of Huffman Tree Algorithm (4) 23 Huffman Code 24 In class example • I will praise you and I will love you Lord Index Sym 0 space 1 I 2 L 3 a 4 d 5 e 6 i 7 l 8 n 9 o 10 p 11 r 12 s 13 u 14 v 15 w 16 y Freq 9 2 1 2 2 2 3 5 1 4 1 2 1 2 1 2 2 25 In class example • I will praise you and I will love you Lord Index Sym 0 space 1 I 2 L 3 a 4 d 5 e 6 i 7 l 8 n 9 o 10 p 11 r 12 s 13 u 14 v 15 w 16 y Freq 9 2 1 2 2 2 3 5 1 4 1 2 1 2 1 2 2 Parent Left Right Nbits Bits 30 -1 -1 2 01 23 -1 -1 5 11010 17 -1 -1 5 00010 20 -1 -1 5 11110 22 -1 -1 5 11101 21 -1 -1 4 0000 25 -1 -1 4 1100 28 -1 -1 3 101 17 -1 -1 5 00011 26 -1 -1 3 001 18 -1 -1 6 100110 23 -1 -1 5 11011 18 -1 -1 6 100111 24 -1 -1 4 1000 19 -1 -1 5 10010 20 -1 -1 5 11111 22 -1 -1 5 11100 26