From bits to bytes to ints At some level everything is stored as either a zero or a one A bit is a binary digit a byte is a binary term (8 bits) We should be grateful we can deal with Strings rather than sequences of 0's and 1's. We should be grateful we can deal with an int rather than the 32 bits that make an int Int values are stored as two's complement numbers with 32 bits, for 64 bits use the type long, a char is 16 bits Standard in Java, different in C/C++ Facilitates addition/subtraction for int values We don't need to worry about this, except to note: • Integer.MAX_VALUE + 1 = Integer.MIN_VALUE • Math.abs(Integer.MIN_VALUE) != Infinity CPS 100 12.1 How are data stored? To facilitate Huffman coding we need to read/write one bit Why do we need to read one bit? Why do we need to write one bit? When do we read 8 bits at a time? Read 32 bits at a time? We can't actually write one bit-at-a-time. We can't really write one char at a time either. Output and input are buffered,minimize memory accesses and disk accesses Why do we care about this when we talk about data structures and algorithms? • Where does data come from? CPS 100 12.2 How do we buffer char output? Done for us as part of InputStream and Reader classes InputStreams are for reading bytes Readers are for reading char values Why do we have both and how do they interact? Reader r = new InputStreamReader(System.in); Do we need to flush our buffers? In the past Java IO has been notoriously slow Do we care about I? About O? This is changing, and the java.nio classes help • Map a file to a region in memory in one operation CPS 100 12.3 Buffer bit output To buffer bit output we need to store bits in a buffer When the buffer is full, we write it. The buffer might overflow, e.g., in process of writing 10 bits to 32-bit capacity buffer that has 29 bits in it How do we access bits, add to buffer, etc.? We need to use bit operations Mask bits -- access individual bits Shift bits – to the left or to the right Bitwise and/or/negate bits CPS 100 12.4 Representing pixels A pixel typically stores RGB and alpha/transparency values Each RGB is a value in the range 0 to 255 The alpha value is also in range 0 to 255 Pixel red = new Pixel(255,0,0,0); Pixel white = new Pixel(255,255,255,0); Typically store these values as int values, a picture is simply an array of int values void process(int pixel){ int blue = pixel & 0xff; int green = (pixel >> 8) & 0xff; int red = (pixel>> 16) & 0xff; } CPS 100 12.5 Bit masks and shifts void process(int pixel){ int blue = pixel & 0xff; int green = (pixel >> 8) & 0xff; int red = (pixel >> 16) & 0xff; } Hexadecimal number: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f Note that f is 15, in binary this is 1111, one less than 10000 The hex number 0xff is an 8 bit number, all ones The bitwise & operator creates an 8 bit value, 0—255 (why) 1&1 == 1, otherwise we get 0, similar to logical and Similarly we have |, bitwise or CPS 100 12.6 Bit operations revisited How do we write out all of the bits of a number /** * writes the bit representation of a int * to standard out */ void bits(int val) { CPS 100 12.7 Text Compression Input: String S Output: String S Shorter S can be reconstructed from S CPS 100 12.8 Text Compression: Examples “abcde” in the different formats ASCII: 01100001011000100110001101100100… Fixed: 000001010011100 1 Encodings ASCII: 8 bits/character Unicode: 16 bits/character 0 Symbol ASCII Fixed length Var. length a 01100001 000 000 b 01100010 001 11 c 01100011 010 01 d 01100100 011 001 e 01100101 100 10 0 0 a Var: 000110100110 1 1 b 0 1 c d 0 0 e 1 0 0 0 CPS 100 0 a 1 1 d c 0 e 1 b 12.9 Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 101 101 13 7 6 g o 3 3 Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it? 4 3 s * 1 2 2 2 p h e r 1 1 1 1 6 4 2 2 g o 3 3 3 CPS 100 p h e r 1 1 1 1 s * 1 2 12.10 Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned a codeword Codeword is for a given letter is produced by traversing the Huffman tree Property: No codeword produced is the prefix of another Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method CPS 100 12.11 Building a Huffman tree Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm…… CPS 100 12.12 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 I E N S M A B O T G D L R U P F CPS 100 1 12.13 C Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 1 1 F C 6 5 5 4 4 3 3 3 3 I E N S M A B O T CPS 100 2 2 2 2 2 2 1 G D L R U P 12.14 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 3 1 1 1 2 C F P U 6 5 5 4 4 I E N S M CPS 100 3 3 3 3 3 A B O T 2 2 2 2 2 G D L R 12.15 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CPS 100 4 4 4 S M 3 2 2 L R 2 3 3 3 3 A B O T 2 2 G D 12.16 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CPS 100 4 4 4 4 S M 3 4 2 2 2 2 G D L R 3 3 3 3 A B O T 2 12.17 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 4 2 3 O 11 6 1 1 1 2 C F P U CPS 100 I 5 5 5 E N 4 4 4 4 S M 4 2 2 2 2 G D L R 3 3 3 A B T 3 12.18 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U CPS 100 6 I 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 3 3 S M A B 12.19 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U CPS 100 6 6 I 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 S M 3 3 A B 12.20 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 6 6 6 I 5 5 5 E N 4 4 2 2 2 2 G D L R 4 3 3 A B 12.21 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 8 6 6 6 I 5 6 8 5 5 E N 4 2 2 2 2 G D L R 3 3 A B 12.22 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 8 5 6 E 3 2 3 O 11 10 4 4 S M 4 3 T 1 1 1 2 C F P U CPS 100 8 8 6 6 6 I 5 6 8 5 N 4 2 2 2 2 G D L R 3 3 A B 12.23 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 11 5 5 E 6 N 3 2 3 O 11 8 11 3 T 1 1 1 2 C F P U 10 CPS 100 8 8 6 6 I 6 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.24 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 8 6 8 6 I 5 5 5 E N 3 2 3 O 12 6 11 3 T 1 1 1 2 C F P U 11 CPS 100 10 8 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.25 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 16 6 12 3 T 1 1 1 2 C F P U 11 CPS 100 11 10 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.26 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 21 6 16 3 T 1 1 1 2 C F P U 12 CPS 100 11 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.27 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 23 6 21 3 T 1 1 1 2 C F P U 16 CPS 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.28 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 37 6 23 3 T 1 1 1 2 C F P U CPS 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.29 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 60 6 3 T 1 1 1 2 C F P U CPS 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.30 Huffman Complexities How do we measure? Size of input file, size of alphabet Which is typically bigger? Accumulating character counts: ______ How can we do this in O(1) time, though not really Building the heap/priority queue from counts ____ Initializing heap guaranteed Building Huffman tree ____ Why? Create table of encodings from tree ____ Why? Write tree and compressed file _____ CPS 100 12.31 Properties of Huffman coding Want to minimize weighted path length L(T)of tree T L(T ) d w i iLeaf ( T ) i wi is the weight or count of each codeword i di is the leaf corresponding to codeword i How do we calculate character (codeword) frequencies? Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently? CPS 100 12.32 Writing code out to file How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved How do we know bits come from compressed file? Store a magic number CPS 100 12.33 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.34 Decoding a message 1100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.35 Decoding a message 100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.36 Decoding a message 00000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 12.37 Decoding a message 0000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 12.38 Decoding a message 000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 12.39 Decoding a message 00100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 12.40 Decoding a message 0100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 12.41 Decoding a message 100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 12.42 Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 12.43 Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 12.44 Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 12.45 Decoding a message 01001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 12.46 Decoding a message 1001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 12.47 Decoding a message 001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 12.48 Decoding a message 01101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 12.49 Decoding a message 1101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 12.50 Decoding a message 101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 12.51 Decoding a message 01 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 12.52 Decoding a message 1 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R GOOD 3 3 A B 12.53 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CPS 100 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R GOOD 3 3 A B 12.54 Decoding 1. Read in tree data O( ) 2. Decode bit string with tree O( ) CPS 100 12.55 Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 ?? ?? choose two smallest weights combine nodes + weights Repeat Priority queue? Encoding uses tree: 0 left/1 right How many bits? CPS 100 g o p h e r s * 3 3 1 1 1 1 1 2 2 2 3 p h e r s * 1 1 1 1 1 2 6 4 2 2 p h e r 1 1 1 1 g o 3 3 12.56 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.57 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.58 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.59 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.60 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.61 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.62 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.63 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.64 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CPS 100 12.65 Other methods Adaptive Huffman coding Lempel-Ziv algorithms Build the coding table on the fly while reading document Coding table changes dynamically Protocol between encoder and decoder so that everyone is always using the right coding scheme Works well in practice (compress, gzip, etc.) More complicated methods Burrows-Wheeler (bunzip2) PPM statistical methods CPS 100 12.66