Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein Representing Text (ASCII) • Way of representing characters as bits – Characters are ‘a’, ‘b’, ‘1’ , ‘%’, ‘@’, ‘\n’, ‘\t’… • Each character is represented by a unique 7 bit code. There are 128 possible characters. – STATIC LENGTH ENCODING • To encode a long text, we encode it character by character. Inefficiency of ASCII • Realization: In many natural files, we are much more likely to see the letter ‘e’ than the character ‘&’, yet they are both encoded using 7 bits! • Solution: Use variable length encoding! The encoding for ‘e’ should be shorter than the encoding for ‘&’. Variable Length Coding • Assume we know the distribution of characters (‘e’ appears 1000 times, ‘&’ appears 1 time) • Each character will be encoded using a number of bits that is inversely proportional to its frequency (made precise later). • Need a ‘prefix free’ encoding: if ‘e’ = 001 than we cannot assign ‘&’ to be 0011. Since encoding is variable length, need to know when to stop. Encoding Trees • Think of encoding as an (unbalanced) tree. • Data is in leaf nodes only (prefix free). 1 0 e 0 a • ‘e’ = 0, ‘a’ = 10, ‘b’ = 11 • How to decode ‘01110’? 1 b Cost of a Tree • For each character ci let fi be its frequency in the file. • Given an encoding tree T, let di be the depth of ci in the tree (number of bits needed to encode the character). • The length of the file after encoding it with the coding scheme defined by T will be C(T)= Σdi fi Creating an Optimal T • Problem: Find tree T with C(T) minimal. • Solution (Huffman 1952): – Create a tree for each character. The weight of the tree W(T) is the frequency of the character. – Repeat n-1 times (n = number of chars) • Select trees T’, T’’ with lowest weights. Merge them together to form T. • Set W(T)= W(T’) + W(T’’) • Implement Using Min-Heap. • What is running time? Optimality Intuition • Need to show that Huffman’s algorithm indeed results in a tree T with optimal C(T)= Σci fi. • The two least weight letters should be on bottom as siblings (otherwise improve cost by swapping). • Intuitively when we combine trees we can think of this as a new letter with combined weight. Homework • Implement: – public class HuffmanTree • Has traversal/code printing method – public class HuffmanNode (Comparable) • Contains letter, integer frequency • Has accessor (getter) methods – public class BinaryHeap (given in class) • Read a file ‘huff.txt’ which includes letters and frequencies: – A 20 E 24 G 3 H 4 I 17 L 6 N 5 O 10 S 8 V 1 W 2 • Create a Huffman Tree, algorithm: book 389-395 • Print “legend”: the code of each character Tips and Implementation Notes • HuffmanNode should be Comparable to work with BinaryHeap – How to implement compareTo method? • Implement toString method in BinaryHeap – Print heap after every rearrangement • Understand binary heap operations: – insert – deleteMin 10