Text Compression Input: String S Output: String S Shorter S can be reconstructed from S CompSci 100 21.1 Text Compression: Examples “abcde” in the different formats ASCII: 01100001011000100110001101100100… Fixed: 000001010011100 Encodings ASCII: 8 bits/character Unicode: 16 bits/character 1 0 Symbol ASCII a 01100001 Fixed length 000 Var. length 000 b 01100010 001 11 c 01100011 010 01 d 01100100 011 001 e 01100101 100 10 0 0 a Var: 000110100110 1 1 b 0 1 c d 0 0 e 1 0 0 0 a CompSci 100 0 1 1 c 0 e 1 b d 21.2 Huffman coding: go go gophers ASCII(7bits) 3 bits Huffman g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 100 101 13 7 6 g o 3 3 s * 1 2 Encoding uses tree: 2 2 p h e r 1 1 1 1 6 0 left/1 right How many bits? 37!! Savings? Worth it? CompSci 100 4 3 4 2 2 g o 3 3 3 p h e r 1 1 1 1 s * 1 2 21.3 Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned a codeword Codeword is for a given letter is produced by traversing the Huffman tree Property: No codeword produced is the prefix of another Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is an optimal per-character coding method CompSci 100 21.4 Building a Huffman tree Begin with a forest of single-node trees (leaves) Repeat until there is only one node left: root of tree Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Remove two minimally weighted trees from forest Create new tree with minimal trees as children, o New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, hmmm…… CompSci 100 21.5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 I E N S M A B O T G D L R U P F CompSci 100 1 C 21.6 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 1 1 F C 6 5 5 4 4 3 3 3 3 I E N S M A B O T CompSci 100 2 2 2 2 2 2 1 G D L R U P 21.7 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 3 1 1 1 2 C F P U 6 5 5 4 4 I E N S M CompSci 100 3 3 3 3 3 A B O T 2 2 2 2 2 G D L R 21.8 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 100 4 4 4 S M 3 2 2 L R 2 3 3 3 3 A B O T 2 2 G D 21.9 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 100 4 4 4 4 S M 3 4 2 2 2 2 G D L R 3 3 3 3 A B O T 2 21.10 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 4 2 3 O 11 6 I 1 1 1 2 C F P U 5 CompSci 100 5 5 E N 4 4 4 4 S M 4 2 2 2 2 G D L R 3 3 3 A B T 3 21.11 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 I CompSci 100 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 3 3 S M A B 21.12 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 CompSci 100 6 I 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 S M 3 3 A B 21.13 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 6 CompSci 100 6 6 I 5 5 5 E N 4 4 2 2 2 2 G D L R 3 3 A B 4 21.14 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 8 CompSci 100 6 6 6 I 5 6 8 5 5 E N 4 2 2 2 2 G D L R 3 3 A B 21.15 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 8 5 6 E 3 2 3 O 11 10 4 4 S M 4 3 T 1 1 1 2 C F P U 8 CompSci 100 8 6 6 6 I 5 6 8 4 2 2 2 2 G D L R 3 3 A B 5 N 21.16 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 11 5 5 E 6 N 3 2 3 O 11 8 11 3 T 1 1 1 2 C F P U 10 CompSci 100 8 8 6 6 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 6 I 21.17 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 8 6 8 6 I 5 5 5 E N 3 2 3 O 12 6 11 3 T 1 1 1 2 C F P U 11 CompSci 100 10 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 8 21.18 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 16 6 12 3 T 1 1 1 2 C F P U 11 CompSci 100 11 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 10 21.19 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 21 6 16 3 T 1 1 1 2 C F P U 12 CompSci 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 11 21.20 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 23 6 21 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 16 CompSci 100 21.21 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 37 6 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 23 CompSci 100 21.22 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 60 CompSci 100 21.23 Encoding 1. Count occurrence of all occurring characters O( N ) 2. Build priority queue O( A ) 3. Build Huffman tree O( A log A ) 4. Create Table of codes from tree O( A log A ) Write Huffman tree and coded data to file O( 5. CompSci 100 N ) 21.24 Properties of Huffman coding Want to minimize weighted path length L(T)of tree T L(T ) i iLeaf ( T ) i wi is the weight or count of each codeword i di is the leaf corresponding to codeword i How do we calculate character (codeword) frequencies? Huffman coding creates pretty full bushy trees? d w When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently? CompSci 100 21.25 Writing code out to file How do we go from characters to encodings? Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering See BitInputStream and BitOutputStream classes Build Huffman tree Root-to-leaf path generates encoding Depend on each other, bit ordering preserved How do we know bits come from compressed file? Store a magic number CompSci 100 21.26 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 21.27 Decoding a message 1100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 21.28 Decoding a message 100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 21.29 Decoding a message 00000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U CompSci 100 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 21.30 Decoding a message 0000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100 21.31 Decoding a message 000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100 21.32 Decoding a message 00100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100 21.33 Decoding a message 0100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100 21.34 Decoding a message 100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B G CompSci 100 21.35 Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100 21.36 Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100 21.37 Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100 21.38 Decoding a message 01001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100 21.39 Decoding a message 1001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GO CompSci 100 21.40 Decoding a message 001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100 21.41 Decoding a message 01101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100 21.42 Decoding a message 1101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100 21.43 Decoding a message 101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100 21.44 Decoding a message 01 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 4 4 S M 3 T 1 1 1 2 C F P U 4 4 2 2 2 2 G D L R 3 3 A B GOO CompSci 100 21.45 Decoding a message 1 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B GOOD CompSci 100 21.46 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B GOOD CompSci 100 21.47 Decoding 1. Read in tree data O( ) 2. Decode bit string with tree O( ) CompSci 100 21.48 Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 ?? ?? g o p h e r s * 3 3 1 1 1 1 1 2 2 2 3 p h e r s * 1 1 1 1 1 2 choose two smallest weights 3 bits Huffman combine nodes + weights Repeat Priority queue? 0 left/1 right How many bits? CompSci 100 2 2 Encoding uses tree: 6 4 p h e r 1 1 1 1 g o 3 3 21.49 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.50 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.51 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.52 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.53 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.54 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.55 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.56 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.57 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 100 21.58 Other methods Adaptive Huffman coding Lempel-Ziv algorithms Build the coding table on the fly while reading document Coding table changes dynamically Protocol between encoder and decoder so that everyone is always using the right coding scheme Works well in practice (compress, gzip, etc.) More complicated methods Burrows-Wheeler (bunzip2) PPM statistical methods CompSci 100 21.59