Huffman Code for Document

advertisement
Huffman Codes
Computing an Optimal Code for a Document
1
Objectives
You will be able to:



Create an optimal code for an ASCII text file.
Encode the text file using the optimal code and output
the compressed text as a binary file.
Read the compressed binary file and reconstruct the
original ASCII text.

Output the decoded message to a text file.

Encode and decode a large text file

Moby Dick
2
Getting Started


Download program from last class.
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/
2011_04_13_Huffman_Codes_with_Binary_IO/


File Huffman_Codes_with_Binary_IO.zip
A bit of cleanup



Improve the prompts as shown on the following slides.
Delete commented out sections in main.cpp
Remove output of sorted list in Make_Decode_Tree
3
Modifications to Prompts
main.cpp
In do_decode (line 29)

//cout << "File name for input? ";
cout << "File name for compressed input file? ";
In do_encode (line 89)

//cout << "File name for output? ";
cout << "File name for compressed output file? ";
4
An Error on Circe

Binary_File.h, line 14 should be:
static const size_t FIRST_BIT_POSITION = 8*sizeof(size_t);

int and size_t are the same size on 32 bit
Windows systems.



Not on Circe.
Probably not on other 64 bit systems.
Other errors and warnings on Circe have fairly
obvious fixes.
5
Program Running
6
Text Files for Testing

Download to a convenient directory:

Full text of Moby Dick



http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/
Moby_Dick.txt
Abridged version
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/
Moby_Quick.txt
7
Moby Dick (Abridged)
8
Get Input from a File

Modify the Huffman Code program to get its
input for encode from a text file rather than
from the keyboard.
9
main.cpp

Insert above do_encode:
void get_text_input_file(string& input_filename, ifstream& infile)
{
string junk;
while (true)
{
cout << "File name for text input? ";
cin >> input_filename;
getline(cin, junk); // Skip newline char
infile.open(input_filename.c_str());
if (infile.good())
{
break;
}
infile.clear();
cout << "Open failed for file " << input_filename << endl;
cout << "Please try again\n";
}
}
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_1
10
8_Huffman_Code_for_Document/get_text_input_file.cpp.txt
do_encode()
Revised version that gets input from a file rather than from
the keyboard:
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/
2011_04_18_Huffman_Code_for_Document/do_encode.cpp.txt
11
do_encode()
void do_encode(void)
{
string msg;
string output_filename;
Binary_Output_File* outfile;
string junk;
string input_filename;
ifstream infile;
get_text_input_file(input_filename, infile);
while (true)
{
cout << "\nFile name for compressed output file? ";
cin >> output_filename;
getline(cin,junk);
// Skip newline char
try
{
outfile =new Binary_Output_File(output_filename);
break;
}
catch (const string& msg)
{
cout << msg << endl;
}
}
12
do_encode()
//cout << "\n\nEnter message to encode\n";
//getline(cin, msg);
while (infile.good())
{
char next_char;
infile.get(next_char);
string code = huffman_tree.Encode_Char(tolower(next_char));
if (code.size() == 0)
{
cout << endl << "Invalid character in input "
<< next_char << endl;
continue;
}
outfile->Output(code);
}
infile.close();
cout << endl << endl;
outfile->Close();
delete(outfile);
cout << "File " << output_filename << " written\n";
}
13
Program in Action
14
Program continuing
15
Some Issues

White space

newline characters lost

Punctuation
Capitalization

Let's build a code specifically for this document.



Include all characters.
Optimize weights for the document.
16
Developing a Code for the Document


New version of build_huffman_tree
Read the input text file and count occurrences of each
character.


Also total number of characters in the file
For each ASCII value that appears in the input text file


Compute relative frequency.
Add char and frequency to the Huffman tree.
17
New build_huffman_tree()
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04
_18_Huffman_Code_for_Document/build_huffman_tree.cpp.txt
void build_huffman_tree(ifstream& infile)
{
int counts[128] = {0};
int total = 0;
// Count characters in the input file.
while (infile.good())
{
char next_char;
infile.get(next_char);
assert (next_char > 0);
assert (next_char <= 127);
++counts[next_char];
++total;
}
infile.close();
infile.clear();
18
New build_huffman_tree()
for (int i = 0; i < 128; ++i)
{
if (counts[i] > 0)
{
huffman_tree.Add(i, (1.0*counts[i]) / total);
}
}
}
19
main.cpp

Add at top:
#include <cassert>
…
string input_filename;
ifstream infile;

Add to main()
int main(void)
{
cout << "This is the Huffman code program \n";
get_text_input_file(input_filename, infile);
build_huffman_tree(infile);
20
do_encode()

We have to reopen the input file after reading
it Build_Huffman_Tree.



No longer call get_text_input_file.
Comment out call to get_text_input_file near
the top.
At line 104:
//cout << "\n\nEnter message to encode\n";
//getline(cin, msg);
infile.open(input_filename.c_str());
while (infile.good())
{
21
do_encode()

At line 112 remove call to tolower()
infile.open(input_filename.c_str());
while (infile.good())
{
char next_char;
infile.get(next_char);
string code = huffman_tree.Encode_Char(tolower(next_char));

We now can encode all characters.
22
Program Running
23
So far, so good!


The program seems to be working for a short
file.
Let's try it on the full text.

You may not want to wait for the complete output!
24
Output Decoded Message to a File
Add above do_decode():
http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huf
fman_Code_for_Document/get_text_output_file.cpp.txt
void get_text_output_file(string& output_filename, ofstream& outfile)
{
string junk;
while (true)
{
cout << "File name for text output? ";
cin >> output_filename;
getline(cin, junk); // Skip newline char
outfile.open(output_filename.c_str());
if (outfile.good())
{
break;
}
outfile.clear();
cout << "Open failed for file " << output_filename << endl;
cout << "Please try again\n";
}
}
25
Output Decoded Message to a File

At end of do_decode
original_message = huffman_tree.Decode_Msg(coded_message);
//cout << "Original message: " << original_message << endl;
//cout << endl << endl;
string output_filename;
ofstream outfile;
get_text_output_file(output_filename, outfile);
outfile << original_message;
outfile.close();
cout << "File " << output_filename << " written";
cout << endl << endl;
}
26
Test on Full Text of Moby Dick
27
Test on Full Text of Moby Dick
28
On Circe
(After some tweaking)
29
Embedding the Code

In order for the compressed file to be useful,
we have to store the code along with it.


Then we can read and decode the file at a later time.
 Even on a different computer (with the same architecture)
In order to decode



First read the code.
Reconstitute the decode tree.
Then read and decode the message.
Project 7
30
Download