Bits and Huffman Encoding Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing your notes. How is data stored in a computer? • Several different ways are possible • In an ASCII-encoded text file (hithero known as a “text file”), each character consists of 8bit chunks • What’s a bit? Bits Mean What You Want Them To Mean 01001101 01001001 01001011 01000101 M K I E But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks? 0100 1101 0100 1001 0100 1011 0100 0101 4 13 4 9 4 11 4 5 With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are 24 (16) possible characters. What if we want to save memory? • Could just use fewer bits per chunk, but that limits the number of characters we can represent • But my document contains a lot more ‘a’s than ‘%’s. What if the code for a was “01” and the code for % was “10000101” A Problem Mike decides he wants to compress his vast collection of Harry Potter fanfic (all stored a textfiles). He notices that the character ‘H’ occurs much more frequently in the textfiles than does the character ‘%’, but they both use 8 bits. So he changes the encoding so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by ‘1010101010101010’ (16 bits). Assuming all other characters stay at 8 bits (a bit unlikely…but just pretend)…how much more frequent do ‘H’s need to be than ‘%’s for there to be a space savings? 1. 2. 3. 4. There must be more than 3Hs for every %s There must be more than 3Hs for every 2%s There must be more than 4Hs for every 3%s As long as there more Hs than %s you get a savings • We can get savings if we know what characters occur frequently • So imagine you look at some set of data with character frequencies. How much can you save? Consider the string: • AACCCAAABAADAE Letter Frequency A 8 C 3 B 1 D 1 E 1 What’s the best encoding? Why will this encoding not work? Letter Frequency Encoding A 8 0 C 3 1 B 1 10 D 1 01 E 1 11 What is the encoding for AAE? What is the encoding for ADC? Consider the string: • AACCCAAABAADAE Letter Frequency Encoding A 8 ? C 3 ? B 1 ? D 1 ? E 1 ? The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 Work with your row to find the most efficient possible encoding for each character. I was able to encode the entire string in 25 bits. See if you can do it that efficiently – but be careful with prefixes. Write your encoding, plus your computation of its total length, on your handin sheet. My Encoding • AACCCAAABAADAE Letter Frequency Encoding A 8 0 C 3 11 B 1 100 D 1 1010 E 1 1011 The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 There a several possible encodings that can give you 25. I’m pretty sure there’s no more efficient encoding, at least as far as compressing individual characters. Given a set of frequencies, how should you select the character encoding for maximum efficiency? Huffman Encoding The basic idea • We will make a “huffman tree”. By repeatedly combining nodes with small frequencies into “meta nodes” with larger frequency • A letter’s position in the tree will tell us what its encoded form is Generate a huffman tree and encodings for this circumstance Letter2 Frequency Encoding A 12 ? C 9 ? B 8 ? D 8 ? E 5 ? F 5 ? Follow the tutorial linked off the resources section in Sakai Write the huffman tree and the resultant encoding on your handin sheet. The Header • The Magic Number • Counts • Or a tree for extra credit The Psuedo-EOF character • A made up character to write to your file Please turn your sheet in at the back of the room (try for a neat pile)