March 26, 2009: Lab 3: Huffman Coding

advertisement
Lab 3: Huffman Coding
Overview:
The Huffman code is a fundamental encoding in the field of data compression. It forms
the basis for many modern compression approaches, including the DEFLATE algorithm
used to compress data in ZIP files. It is a variable-length code, meaning each character
may have a different number of bits (rather than 8 for each, as in ASCII).
The principle that allows the Huffman code to obtain very good compression rates is the
representation of the most frequent characters with the smallest number of bits. That is, if
the letter “E” occurs most in a corpus of text, “E” will be represented by a short string of
bits, such as “1”. Conversely, infrequent letters, such as “X”, will be represented with
longer codes, such as “001011”.
Although Huffman encoding saves space by using only as many bits as are required, the
variable-length encoding does present problems: if “1” represents “E” and “0” represents
“S”, what happens when we need to represent “R” with “01”? How do we tell “R” apart
from “SE”?
The solution is to avoid using codes that could clash with each other. For example, we
would not represent “S” with “0”; we would represent it with “01”, which can be
distinguished from “1” (the leading “0” clues us in).
This encoding can be naturally modeled using a tree where moving left corresponds to a
“0”, while moving right corresponds to a “1”. No codes can conflict because no node can
have two parents:
0
00
S
1
01
E
R
Huffman’s discovery was to realize that this tree could be constructed by assigning each
character a priority based on its frequency in the input, storing the characters and
frequencies in leaves of the tree, and iteratively building the tree up by merging the two
elements with the lowest frequency into one node. Elements with low frequencies do this
early and frequently, so they are placed at the bottom of the tree (long code lengths),
while elements with higher frequencies do this later, and end up near the top (short code
lengths).
Once the tree is built, the characters are replaced by their codes:
E -> 1
S -> 00
R -> 01
The text “EEEEESSREE”, which consumes 80 bits in uncompressed ASCII, would be
encoded as: 1111100000111, which is only 13 bits long – less than 2 characters in ASCII.
If the mappings are also stored, it is possible to decompress the code by simply recreating the tree from the mappings and traversing either left or right depending on the
next bit in the code, then outputting the letter when a leaf is reached. The code we just
discovered:
1111100000111
Would then translate back to “EEEEESSREE”, the original input string.
Lab Work:
I have given you classes that will generate frequency tables from the text, construct a
Huffman tree from the frequency table, and Huffman-encode a string. Essentially, I have
written the encoding algorithm for you. For this lab, I would like you to write the
decoding algorithm, given the encoded string and the Serialized mappings between
characters and codes. This entails:
1. Deserializing the mappings. See the section below on Serialization.
2. Reconstructing the Huffman tree from the mappings (remember, “0” means “left”
and “1” means “right”; create nodes as you traverse. Also remember that only the
values in the leaves matter – the interior nodes can have null values).
3. Reading the input string one character at a time (in a real Huffman code, it should
be one bit at a time, but I’ve simplified your task by outputting ASCII “1”s and
“0”s instead of packing bits) and traversing the tree left or right based on the
value of the character. You only output a character once you hit a leaf.
The mappings between are stored in a HashMap<Character, String>.
I’ve provided Javadoc documentation for the classes I’ve written. Both the
documentation and the source are available on the course website.
Serialization:
Serialization is the process of converting an object into a string of text that can be stored
in a file, transmitted over a socket, or otherwise sent around in a way that objects usually
can’t be. Deserialization is the process of converting this representation back into an
object. Any object that implements the Serializable interface (which defines no methods)
can be serialized.
In Java, you may serialize an object by writing it to an ObjectOutputStream.
ObjectOutputStream’s constructor takes another (text-based) output stream, which is the
underlying stream the object will be written to. For example, you may write an object
called “HashMap<Character, String> targ” out to disk as follows:
FileOutputStream fout = new FileOutputStream(“serializedobject.txt”);
ObjectOutputStream oout = new ObjectOutputStream(fout);
oout.writeObject(targ);
oout.close();
Or you may read it from disk as follows:
FileInputStream fin = new FileInputStream(“serializedobject.txt”);
ObjectInputStream oin = new ObjectInputStream(fin);
HashMap<Character, String> targ = (HashMap<Character, String>) oin.readObject();
Download