Squishin’ Stuff Huffman Compression Data Compression • • • Begin with a computer file (text, picture, movie, sound, executable, etc) Most file contain extra information or redundancy Goal: Reorganize the file to remove the excess information and redundancy • Lossless Compression: Compress the file in such a way that none of the information is lost (good for text files and executables) Lossy Compression: Allow some information to be thrown away in order to get a better level of compression (good for pictures, movies, or sounds) • • • • • Many, many, many algorithms out there to compress files Different types of files work best with different algorithms (need to consider the structure of the file and how things are connected). We’re going to focus on Huffman compression which is used many compression programs, most notably winzip. We’re just going to play with text files. Text Files • Each character is represented by one byte. Each byte is a sequence of 8 bits (1’s and 0’s) (ASCII code). • International standard for how a character is represented. • • • • A B ~ 3 01000001 01000010 01111110 00110011 • Most text files use less than 128 characters; this code has room for 256. Extra information!! • Goal: Use shorter codes to represent more frequent characters. • You have seen this before… Morse Code A .B -... C -.-. D -.. E . F ..-. G --. H .... I .. J .--K -.L .-.. M -- N -. O --P .--. Q --.R .-. S ... T U ..V ...W .-X -..Y -.-Z --.. 0 ----1 .---2 ..--3 ...-4 ....5 ..... 6 -.... 7 --... 8 ---.. 9 ----. Fullstop .-.-.Comma --..-Query ..--.. Example DANIEL LEWIS DREIBELBIS Break down the frequencies of each letter: Letter Frequency Letter Frequency A B D E I 1 2 2 4 4 L N R S W 3 1 1 2 1 Example Now assign short codes to the letters that occur most often: Letter Frequency Code E I L D S B A N R W 4 4 3 2 2 2 1 1 1 1 0 1 00 01 10 11 000 001 010 011 # of Digits 4 4 6 4 4 4 3 3 3 3 Total Digits: 38. Usual way: 21*8=168. Compression: 22.6% of the original size, or a 77.4% decrease. My name is now: 010000011000 000011110 01010011100011110 RAWA AWIS RINBABBE That didn’t work. If we do this, we need a way to know when a letter stops. Huffman coding provides this, though we’ll lose some compression. Huffman Coding Named after some guy called Huffman (1952). Use a tree to construct the code, and then use the tree to interpret the code. Huffman Chart E4 E4 E4 E4 EI8 I4 I4 I4 I4 EILD13 L3 L3 L3 LD5 D2 S2 D2 D2 EILDSBANRW21 S2 SB4 B2 SB4 B2 A1 SBANRW8 AN2 N1 R1 RW2 W1 LD5 ANRW4 ANRW4 4 4 SBANRW8 Issues and Problems What if all the frequencies were the same? AAAABBBBCCCCDDDDEEEEFFFFGGGG Or ABCDEFGABCDEFGABCDEFGABCDEFG Huffman can’t do much if the all the frequencies are the same, or if they are all similar (can’t give preference to higher occurring characters). Issues and Problems If there are patterns, you can use a different compression scheme to get rid of the patterns, then use Huffman coding to reduce the rest (this is what winzip does). If your text is random, then you are out of luck. FDSAFSDFASFASDFSFDADFSFFASFDDFASFASFDSFDS Also, method requires you to pass through the text twice. That’s time consuming. Adaptive Huffman Coding builds the frequency as you move along, and changes the tree as new information comes in. What’s the best you can do? • Obviously, there is a limit to how far down you can compress a file. • Assume your file has n different characters in it, say a1…an, each with probability p1…pn (so p1+p2+…+pn = 1). • The entropy of the file is defined to be negative of the sum of pilog2(pi). • Measures the least number of bits, on average, needed to represent a character. • For my name, the entropy is 3.12 (takes at least 3.12 bits per character to represent my name). Huffman gave an average of 3.19 bits per character. • Huffman compression will always give an average that is within one bit of entropy.