HW5 - UMass CS !EdLab

advertisement
CmpSci 187
Spring 2011
HW5
Professor Lehnert
This assignment will use Huffman trees to compress plain text files using prefix
codes. Huffman trees are described in your text (pp. 579-581). Review that
algorithm before you continue.
Encoding Text
We will first extract a frequency table for the text we want to compress. Our
alphabet contains at most 26 lowercase characters, 26 uppercase characters, blanks
and some punctuation (we will restrict our alphabet to 64 characters for reasons
that will become clear later). Our first task will be to read all the characters in the
text file and produce a frequency table for all the characters in the text.
Once we have the frequency table, we will create a Huffman tree and associate a
varible-length binary string with each character in the text. The Huffman tree will be
saved to a file using object Serialization.
We will then use our Huffman codes to create one long binary string for the original
text file. To save this representation to a text file, we will chop the binary string up
into short substrings that can be easily converted back into ASCII characters. In this
way, we will see 1-4 characters from the input text file converted to a single
character in the output text file.
Decoding Text
Starting with individual characters in the compressed file, we first map each of those
characters to a numeric equivalent. Then each of those numbers will be translated
into its binary (base 2) representation and all of those binary strings will be
concatenated into one long binary string.
We then read in our serialized Huffman tree and use that tree to decode the binary
string.
Required Classes and Methods
 You must have a public HW5 class that contains the following methods:
(1) public Freqs createTable(String textfile) is used to generate a frequency
table for all the characters in the text file.
(2) public void createHuffman(Freqs table, String textfile) is used to create a
Huffman tree and save it to the file named textfile using Serialization.
(3) public void encode(String inputfile, String outputfile, String huff) reads
a Huffman tree from the file named huff, encodes the text in the inputfile
using that Huffman tree, and stores the encoded (compressed) text in the
outputfile.
1
CmpSci 187
Spring 2011
HW5
Professor Lehnert
(4) public void decode (String inputfile, String outputfile, String huff) reads
a Huffman tree from the file named huff, uses that tree to decode the text in
the inputfile, and saves the decoded (uncompressed) text in the outputfile.
NOTE: When we write the encoded text to the output file, we will convert the
Huffman encoding into printable characters by mapping bit sequences of length 6
into printable ASCII characters. This way we can see the actual compression in our
file conversions (see the example).
 You must have a public Freqs class that contains the following method:
(1) public int getCount (char ch) returns the frequency count for the character
ch. If the character is not found in the table, getCount returns 0.
A Freqs object stores all the characters in the alphabet along with their frequency
counts for a given text. Note that the alphabet for an input file is specific to that file –
there is no universal alphabet. To implement the table, you could create a character
ArrayList and an integer ArrayList containing the corresponding frequency counts
for each of the characters. The index of a character in the first array woud be the
same index used to store its frequency count in the integer array.
 You must have a public Node<E> class that is used to represent binary trees.
Each Node object needs a left subtree, a right subtree, and an attribute of type E for
storing contents. We are giving you this class so you don’t need to write your own.
 You must have a public DictionaryEntry class that is used to represent node
content for your Huffman tree. Each DictionaryEntry object must contain the
character or characters associated with one Huffman tree node along with the
frequency count for that character set. We are giving you this class so you don’t need
to write your own.
An Example
Suppose the input file consists of a single line containing 12 characters:
abcdbcdccddd
This results in a unique* Huffman tree that produces the following character
encodings:
Reality Check: Make sure you and your code can both produce
a
100
the Huffman tree that represents these character codes.
b
101
* this tree is unique if it obeys the order property (see below)
c
11
d
0
The Huffman encoding for the original file will therefore be:
1001011101011101111000
2
CmpSci 187
Spring 2011
HW5
Professor Lehnert
which is actually longer than the original if we don’t convert this back to the ASCII
alphabet. So we will divide this binary string up into blocks of 6 bits which we will
then convert into corresponding integer values:
100101 110101 110111 1000
(but the last block has only 4 bits …)
the integer equivalents for these four blocks are
37 53 55 8
In the ASCII character code, the printable characters start at 32, so we’ll add 32 to
each of these numbers to make sure we’re within the printable character range.
Note: We will restrict our input files to alphabets containing no more than 64
characters. This will keep our offset values in the range 32-95 which maps to the
printable ASCII characters. For a table of all the ASCII character codes, see:
http://www.physiology.wisc.edu/comp/docs/ascii/
After converting 37 53 55 8 to 69 85 87 40 and doing the ASCII character
conversion, our original input file of 12 characters is reduced to a 4-character
encoding:
EUW(
To decode this back to the original input string, we just reverse all the steps:
EUW(  69 85 87 40  37 53 55 8  100101 110101 110111 001000
We have to be careful on that last step. The base-10 to base-2 conversion will
normally produce binary strings of variable lengths. So in order to recover the
original blocks, we must pad any binary strings that have less than 6 characters with
leading 0’s. This generally gets us what we need, but the last block we’re trying to
recover may not be 6 bits long. In our example, the last block originally had only 4
bits. So how can we know the correct length for the last block? To get that last block
right, we should include the length of the last block in our encoding file. Let’s just
output the length of the last block on the first line of the compressed file like this:
4
EUW(
Now we can check that value to see that our Huffman encoding string is really:
100101 110101 110111 1000  1001011101011101111000  abcdbcdccddd
To get from the final binary string to the original input we walk through our
Huffman tree as we traverse the binary string to recover the original text.
3
CmpSci 187
Spring 2011
HW5
Professor Lehnert
Helpful Methods
The following methods will help you with ASCII character conversions. Use
convertCode when you are creating your compressed output file, and use convertInt
when you are decoding a compressed file.
char convertCode(String bits){
int power = 1;
int total = 0;
for (int i = bits.length()-1; i >= 0; i--){
if (bits.charAt(i) == '1') total += power;
power *= 2;
}
return (char)(total+32); // adjust to printable range
}
String convertInt(char ch){
int n = (int)ch;
// n is between 32 and 95
String result = "";
n -= 32; // reverse printable range adjustment
while (n > 0){
result = n%2 + result;
n = n/2;
}
while (result.length() < 6) result = “0” + result;
return result // always returns a string of 6 bits
}
The Node and DictionaryEntry Classes
We are giving you two general classes that can be used to implement Huffman trees. The
Node class implements basic nodes for a generic binary tree, with a few extras.
In particular, the Node class contains two file I/O methods, writeTree and readTree that
you can use to move Huffman trees to and from text files. Use these methods to write
and read Huffman trees to and from text files. These instance methods use object
serialization which is the easy way to save Java objects in files. Don’t worry about
understanding all the file I/O details in these methods. Just use these methods so you
don’t get bogged down in file I/O issues for HW5.
Do note the inclusion of toString methods in both the Node and DictionaryEntry classes.
These methods are called automatically by print or println commands whenever we pass
one of these objects as arguments to print or println.
Also study the recursive traverse method in the Node class. The traverse method
performs a depth-first tree traveral which can be helpful if you need to debug your
Huffman trees. Reality Check: You should be able to recreate the traverse method.
4
CmpSci 187
Spring 2011
HW5
Professor Lehnert
One Last Detail (the Order Property)
Huffman trees are generally not unique. Variations in how the nodes are merged
during the tree building process can create different Huffman codes for the same
alphabet and frequency distribution. In order to test your compressed files for
correctness, we must take steps to make sure we are working with unique Huffman
trees. To that end, we must place one key restriction on the construction of your
Huffman trees. All the non-terminal nodes in your Huffman tree must comply with
the order property. A Huffman tree node complies with the order property as long as
the frequency count associated with its left branch is always less than or equal to
the frequency count associated with its right branch. When you merge nodes during
tree construction, make sure you arrange your left and right child nodes
accordingly.
For example, note that there are many possible Huffman trees we could have
generated for the input data in our example above. But only one of those Huffman
trees complies with the order property.
If your code produces Huffman trees that are not consistent with the order property,
you will lose points.
5
Download