March 26, 2009: Lab 3: Huffman Coding

Lab 3: Huffman Coding Overview: The Huffman code is a fundamental encoding in the field of data compression. It forms the basis for many modern compression approaches, including the DEFLATE algorithm used to compress data in ZIP files. It is a variable-length code, meaning each character may have a different number of bits (rather than 8 for each, as in ASCII). The principle that allows the Huffman code to obtain very good compression rates is the representation of the most frequent characters with the smallest number of bits. That is, if the letter “E” occurs most in a corpus of text, “E” will be represented by a short string of bits, such as “1”. Conversely, infrequent letters, such as “X”, will be represented with longer codes, such as “001011”. Although Huffman encoding saves space by using only as many bits as are required, the variable-length encoding does present problems: if “1” represents “E” and “0” represents “S”, what happens when we need to represent “R” with “01”? How do we tell “R” apart from “SE”? The solution is to avoid using codes that could clash with each other. For example, we would not represent “S” with “0”; we would represent it with “01”, which can be distinguished from “1” (the leading “0” clues us in). This encoding can be naturally modeled using a tree where moving left corresponds to a “0”, while moving right corresponds to a “1”. No codes can conflict because no node can have two parents: 0 00 S 1 01 E R Huffman’s discovery was to realize that this tree could be constructed by assigning each character a priority based on its frequency in the input, storing the characters and frequencies in leaves of the tree, and iteratively building the tree up by merging the two elements with the lowest frequency into one node. Elements with low frequencies do this early and frequently, so they are placed at the bottom of the tree (long code lengths), while elements with higher frequencies do this later, and end up near the top (short code lengths). Once the tree is built, the characters are replaced by their codes: E -> 1 S -> 00 R -> 01 The text “EEEEESSREE”, which consumes 80 bits in uncompressed ASCII, would be encoded as: 1111100000111, which is only 13 bits long – less than 2 characters in ASCII. If the mappings are also stored, it is possible to decompress the code by simply recreating the tree from the mappings and traversing either left or right depending on the next bit in the code, then outputting the letter when a leaf is reached. The code we just discovered: 1111100000111 Would then translate back to “EEEEESSREE”, the original input string. Lab Work: I have given you classes that will generate frequency tables from the text, construct a Huffman tree from the frequency table, and Huffman-encode a string. Essentially, I have written the encoding algorithm for you. For this lab, I would like you to write the decoding algorithm, given the encoded string and the Serialized mappings between characters and codes. This entails: 1. Deserializing the mappings. See the section below on Serialization. 2. Reconstructing the Huffman tree from the mappings (remember, “0” means “left” and “1” means “right”; create nodes as you traverse. Also remember that only the values in the leaves matter – the interior nodes can have null values). 3. Reading the input string one character at a time (in a real Huffman code, it should be one bit at a time, but I’ve simplified your task by outputting ASCII “1”s and “0”s instead of packing bits) and traversing the tree left or right based on the value of the character. You only output a character once you hit a leaf. The mappings between are stored in a HashMap<Character, String>. I’ve provided Javadoc documentation for the classes I’ve written. Both the documentation and the source are available on the course website. Serialization: Serialization is the process of converting an object into a string of text that can be stored in a file, transmitted over a socket, or otherwise sent around in a way that objects usually can’t be. Deserialization is the process of converting this representation back into an object. Any object that implements the Serializable interface (which defines no methods) can be serialized. In Java, you may serialize an object by writing it to an ObjectOutputStream. ObjectOutputStream’s constructor takes another (text-based) output stream, which is the underlying stream the object will be written to. For example, you may write an object called “HashMap<Character, String> targ” out to disk as follows: FileOutputStream fout = new FileOutputStream(“serializedobject.txt”); ObjectOutputStream oout = new ObjectOutputStream(fout); oout.writeObject(targ); oout.close(); Or you may read it from disk as follows: FileInputStream fin = new FileInputStream(“serializedobject.txt”); ObjectInputStream oin = new ObjectInputStream(fin); HashMap<Character, String> targ = (HashMap<Character, String>) oin.readObject();

March 26, 2009: Lab 3: Huffman Coding

Related documents

Products

Support

March 26, 2009: Lab 3: Huffman Coding

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib