CmpSci 187 Spring 2011 HW5 Professor Lehnert This assignment will use Huffman trees to compress plain text files using prefix codes. Huffman trees are described in your text (pp. 579-581). Review that algorithm before you continue. Encoding Text We will first extract a frequency table for the text we want to compress. Our alphabet contains at most 26 lowercase characters, 26 uppercase characters, blanks and some punctuation (we will restrict our alphabet to 64 characters for reasons that will become clear later). Our first task will be to read all the characters in the text file and produce a frequency table for all the characters in the text. Once we have the frequency table, we will create a Huffman tree and associate a varible-length binary string with each character in the text. The Huffman tree will be saved to a file using object Serialization. We will then use our Huffman codes to create one long binary string for the original text file. To save this representation to a text file, we will chop the binary string up into short substrings that can be easily converted back into ASCII characters. In this way, we will see 1-4 characters from the input text file converted to a single character in the output text file. Decoding Text Starting with individual characters in the compressed file, we first map each of those characters to a numeric equivalent. Then each of those numbers will be translated into its binary (base 2) representation and all of those binary strings will be concatenated into one long binary string. We then read in our serialized Huffman tree and use that tree to decode the binary string. Required Classes and Methods You must have a public HW5 class that contains the following methods: (1) public Freqs createTable(String textfile) is used to generate a frequency table for all the characters in the text file. (2) public void createHuffman(Freqs table, String textfile) is used to create a Huffman tree and save it to the file named textfile using Serialization. (3) public void encode(String inputfile, String outputfile, String huff) reads a Huffman tree from the file named huff, encodes the text in the inputfile using that Huffman tree, and stores the encoded (compressed) text in the outputfile. 1 CmpSci 187 Spring 2011 HW5 Professor Lehnert (4) public void decode (String inputfile, String outputfile, String huff) reads a Huffman tree from the file named huff, uses that tree to decode the text in the inputfile, and saves the decoded (uncompressed) text in the outputfile. NOTE: When we write the encoded text to the output file, we will convert the Huffman encoding into printable characters by mapping bit sequences of length 6 into printable ASCII characters. This way we can see the actual compression in our file conversions (see the example). You must have a public Freqs class that contains the following method: (1) public int getCount (char ch) returns the frequency count for the character ch. If the character is not found in the table, getCount returns 0. A Freqs object stores all the characters in the alphabet along with their frequency counts for a given text. Note that the alphabet for an input file is specific to that file – there is no universal alphabet. To implement the table, you could create a character ArrayList and an integer ArrayList containing the corresponding frequency counts for each of the characters. The index of a character in the first array woud be the same index used to store its frequency count in the integer array. You must have a public Node<E> class that is used to represent binary trees. Each Node object needs a left subtree, a right subtree, and an attribute of type E for storing contents. We are giving you this class so you don’t need to write your own. You must have a public DictionaryEntry class that is used to represent node content for your Huffman tree. Each DictionaryEntry object must contain the character or characters associated with one Huffman tree node along with the frequency count for that character set. We are giving you this class so you don’t need to write your own. An Example Suppose the input file consists of a single line containing 12 characters: abcdbcdccddd This results in a unique* Huffman tree that produces the following character encodings: Reality Check: Make sure you and your code can both produce a 100 the Huffman tree that represents these character codes. b 101 * this tree is unique if it obeys the order property (see below) c 11 d 0 The Huffman encoding for the original file will therefore be: 1001011101011101111000 2 CmpSci 187 Spring 2011 HW5 Professor Lehnert which is actually longer than the original if we don’t convert this back to the ASCII alphabet. So we will divide this binary string up into blocks of 6 bits which we will then convert into corresponding integer values: 100101 110101 110111 1000 (but the last block has only 4 bits …) the integer equivalents for these four blocks are 37 53 55 8 In the ASCII character code, the printable characters start at 32, so we’ll add 32 to each of these numbers to make sure we’re within the printable character range. Note: We will restrict our input files to alphabets containing no more than 64 characters. This will keep our offset values in the range 32-95 which maps to the printable ASCII characters. For a table of all the ASCII character codes, see: http://www.physiology.wisc.edu/comp/docs/ascii/ After converting 37 53 55 8 to 69 85 87 40 and doing the ASCII character conversion, our original input file of 12 characters is reduced to a 4-character encoding: EUW( To decode this back to the original input string, we just reverse all the steps: EUW( 69 85 87 40 37 53 55 8 100101 110101 110111 001000 We have to be careful on that last step. The base-10 to base-2 conversion will normally produce binary strings of variable lengths. So in order to recover the original blocks, we must pad any binary strings that have less than 6 characters with leading 0’s. This generally gets us what we need, but the last block we’re trying to recover may not be 6 bits long. In our example, the last block originally had only 4 bits. So how can we know the correct length for the last block? To get that last block right, we should include the length of the last block in our encoding file. Let’s just output the length of the last block on the first line of the compressed file like this: 4 EUW( Now we can check that value to see that our Huffman encoding string is really: 100101 110101 110111 1000 1001011101011101111000 abcdbcdccddd To get from the final binary string to the original input we walk through our Huffman tree as we traverse the binary string to recover the original text. 3 CmpSci 187 Spring 2011 HW5 Professor Lehnert Helpful Methods The following methods will help you with ASCII character conversions. Use convertCode when you are creating your compressed output file, and use convertInt when you are decoding a compressed file. char convertCode(String bits){ int power = 1; int total = 0; for (int i = bits.length()-1; i >= 0; i--){ if (bits.charAt(i) == '1') total += power; power *= 2; } return (char)(total+32); // adjust to printable range } String convertInt(char ch){ int n = (int)ch; // n is between 32 and 95 String result = ""; n -= 32; // reverse printable range adjustment while (n > 0){ result = n%2 + result; n = n/2; } while (result.length() < 6) result = “0” + result; return result // always returns a string of 6 bits } The Node and DictionaryEntry Classes We are giving you two general classes that can be used to implement Huffman trees. The Node class implements basic nodes for a generic binary tree, with a few extras. In particular, the Node class contains two file I/O methods, writeTree and readTree that you can use to move Huffman trees to and from text files. Use these methods to write and read Huffman trees to and from text files. These instance methods use object serialization which is the easy way to save Java objects in files. Don’t worry about understanding all the file I/O details in these methods. Just use these methods so you don’t get bogged down in file I/O issues for HW5. Do note the inclusion of toString methods in both the Node and DictionaryEntry classes. These methods are called automatically by print or println commands whenever we pass one of these objects as arguments to print or println. Also study the recursive traverse method in the Node class. The traverse method performs a depth-first tree traveral which can be helpful if you need to debug your Huffman trees. Reality Check: You should be able to recreate the traverse method. 4 CmpSci 187 Spring 2011 HW5 Professor Lehnert One Last Detail (the Order Property) Huffman trees are generally not unique. Variations in how the nodes are merged during the tree building process can create different Huffman codes for the same alphabet and frequency distribution. In order to test your compressed files for correctness, we must take steps to make sure we are working with unique Huffman trees. To that end, we must place one key restriction on the construction of your Huffman trees. All the non-terminal nodes in your Huffman tree must comply with the order property. A Huffman tree node complies with the order property as long as the frequency count associated with its left branch is always less than or equal to the frequency count associated with its right branch. When you merge nodes during tree construction, make sure you arrange your left and right child nodes accordingly. For example, note that there are many possible Huffman trees we could have generated for the input data in our example above. But only one of those Huffman trees complies with the order property. If your code produces Huffman trees that are not consistent with the order property, you will lose points. 5