Huffman Codes Computing an Optimal Code for a Document 1 Objectives You will be able to: Create an optimal code for an ASCII text file. Encode the text file using the optimal code and output the compressed text as a binary file. Read the compressed binary file and reconstruct the original ASCII text. Output the decoded message to a text file. Encode and decode a large text file Moby Dick 2 Getting Started Download program from last class. http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/ 2011_04_13_Huffman_Codes_with_Binary_IO/ File Huffman_Codes_with_Binary_IO.zip A bit of cleanup Improve the prompts as shown on the following slides. Delete commented out sections in main.cpp Remove output of sorted list in Make_Decode_Tree 3 Modifications to Prompts main.cpp In do_decode (line 29) //cout << "File name for input? "; cout << "File name for compressed input file? "; In do_encode (line 89) //cout << "File name for output? "; cout << "File name for compressed output file? "; 4 An Error on Circe Binary_File.h, line 14 should be: static const size_t FIRST_BIT_POSITION = 8*sizeof(size_t); int and size_t are the same size on 32 bit Windows systems. Not on Circe. Probably not on other 64 bit systems. Other errors and warnings on Circe have fairly obvious fixes. 5 Program Running 6 Text Files for Testing Download to a convenient directory: Full text of Moby Dick http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/ Moby_Dick.txt Abridged version http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/ Moby_Quick.txt 7 Moby Dick (Abridged) 8 Get Input from a File Modify the Huffman Code program to get its input for encode from a text file rather than from the keyboard. 9 main.cpp Insert above do_encode: void get_text_input_file(string& input_filename, ifstream& infile) { string junk; while (true) { cout << "File name for text input? "; cin >> input_filename; getline(cin, junk); // Skip newline char infile.open(input_filename.c_str()); if (infile.good()) { break; } infile.clear(); cout << "Open failed for file " << input_filename << endl; cout << "Please try again\n"; } } http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_1 10 8_Huffman_Code_for_Document/get_text_input_file.cpp.txt do_encode() Revised version that gets input from a file rather than from the keyboard: http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/ 2011_04_18_Huffman_Code_for_Document/do_encode.cpp.txt 11 do_encode() void do_encode(void) { string msg; string output_filename; Binary_Output_File* outfile; string junk; string input_filename; ifstream infile; get_text_input_file(input_filename, infile); while (true) { cout << "\nFile name for compressed output file? "; cin >> output_filename; getline(cin,junk); // Skip newline char try { outfile =new Binary_Output_File(output_filename); break; } catch (const string& msg) { cout << msg << endl; } } 12 do_encode() //cout << "\n\nEnter message to encode\n"; //getline(cin, msg); while (infile.good()) { char next_char; infile.get(next_char); string code = huffman_tree.Encode_Char(tolower(next_char)); if (code.size() == 0) { cout << endl << "Invalid character in input " << next_char << endl; continue; } outfile->Output(code); } infile.close(); cout << endl << endl; outfile->Close(); delete(outfile); cout << "File " << output_filename << " written\n"; } 13 Program in Action 14 Program continuing 15 Some Issues White space newline characters lost Punctuation Capitalization Let's build a code specifically for this document. Include all characters. Optimize weights for the document. 16 Developing a Code for the Document New version of build_huffman_tree Read the input text file and count occurrences of each character. Also total number of characters in the file For each ASCII value that appears in the input text file Compute relative frequency. Add char and frequency to the Huffman tree. 17 New build_huffman_tree() http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04 _18_Huffman_Code_for_Document/build_huffman_tree.cpp.txt void build_huffman_tree(ifstream& infile) { int counts[128] = {0}; int total = 0; // Count characters in the input file. while (infile.good()) { char next_char; infile.get(next_char); assert (next_char > 0); assert (next_char <= 127); ++counts[next_char]; ++total; } infile.close(); infile.clear(); 18 New build_huffman_tree() for (int i = 0; i < 128; ++i) { if (counts[i] > 0) { huffman_tree.Add(i, (1.0*counts[i]) / total); } } } 19 main.cpp Add at top: #include <cassert> … string input_filename; ifstream infile; Add to main() int main(void) { cout << "This is the Huffman code program \n"; get_text_input_file(input_filename, infile); build_huffman_tree(infile); 20 do_encode() We have to reopen the input file after reading it Build_Huffman_Tree. No longer call get_text_input_file. Comment out call to get_text_input_file near the top. At line 104: //cout << "\n\nEnter message to encode\n"; //getline(cin, msg); infile.open(input_filename.c_str()); while (infile.good()) { 21 do_encode() At line 112 remove call to tolower() infile.open(input_filename.c_str()); while (infile.good()) { char next_char; infile.get(next_char); string code = huffman_tree.Encode_Char(tolower(next_char)); We now can encode all characters. 22 Program Running 23 So far, so good! The program seems to be working for a short file. Let's try it on the full text. You may not want to wait for the complete output! 24 Output Decoded Message to a File Add above do_decode(): http://www.cse.usf.edu/~turnerr/Data_Structures/Downloads/2011_04_18_Huf fman_Code_for_Document/get_text_output_file.cpp.txt void get_text_output_file(string& output_filename, ofstream& outfile) { string junk; while (true) { cout << "File name for text output? "; cin >> output_filename; getline(cin, junk); // Skip newline char outfile.open(output_filename.c_str()); if (outfile.good()) { break; } outfile.clear(); cout << "Open failed for file " << output_filename << endl; cout << "Please try again\n"; } } 25 Output Decoded Message to a File At end of do_decode original_message = huffman_tree.Decode_Msg(coded_message); //cout << "Original message: " << original_message << endl; //cout << endl << endl; string output_filename; ofstream outfile; get_text_output_file(output_filename, outfile); outfile << original_message; outfile.close(); cout << "File " << output_filename << " written"; cout << endl << endl; } 26 Test on Full Text of Moby Dick 27 Test on Full Text of Moby Dick 28 On Circe (After some tweaking) 29 Embedding the Code In order for the compressed file to be useful, we have to store the code along with it. Then we can read and decode the file at a later time. Even on a different computer (with the same architecture) In order to decode First read the code. Reconstitute the decode tree. Then read and decode the message. Project 7 30